%matplotlib inline
import pycisTopic
pycisTopic.__version__
'0.1.dev279+g6a433f6.d20210805'
You can easily install pycisTopic from its github repository.
module load Python/3.7.4-GCCcore-6.4.0
git clone https://github.com/aertslab/pycisTopic.git
cd pycisTopic
pip install .
import pycisTopic
pycisTopic.__version__
'0.1.dev269+g9b7247b.d20210726'
You can also check function documentation for further information.
from pycisTopic.qc import *
help(profile_tss)
We will start by loading the barcode metadata from the scRNA-seq analysis. In this case, I will use the annotated loom file, but loading it from a tsv file or similar is also possible (as long as you end up with a similar pd.DataFrame).
# Get metadata from loom file
from loomxpy.loomxpy import SCopeLoom
from pycisTopic.loom import *
path_to_annotated_rna_loom = "/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/add_clusters_scrublet_as_annot/10x_multiome_brain_SCENIC_SCope_output_wAnnot_noDBL-2.loom"
loom = SCopeLoom.read_loom(path_to_annotated_rna_loom)
cell_data = get_metadata(loom)
Importantly, if you are starting from a pandas data frame, the index of your pandas should correspond to BARCODE (e.g. ATGTCTGATAGA-1, additional tags are possible using -; e.g. ATGTCTGATAGA-1-sample_1) and it must contain a 'sample_id' column indicating the sample label fo origin. Let's see how it looks like here. The sample_id for all cells in this tutorial is '10x_multiome_brain'. Alternatively, you can also add a column named 'barcode' to the metadata with the corresponding cell barcodes (in this case the name of the cells will not be used to infer the barcode id).
cell_data
| leiden_res0.9 | leiden_res0.3 | leiden_res1.2 | leiden_res0.6 | cell_type | sample_id | |
|---|---|---|---|---|---|---|
| AAACAGCCATTATGCG-1-10x_multiome_brain | MOL_B_1 (1) | MOL_B (0) | MOL_B_3 (6) | MOL_B_1 (0) | MOL_B | 10x_multiome_brain |
| AAACCAACATAGACCC-1-10x_multiome_brain | MOL_B_3 (5) | MOL_B (0) | MOL_B_4 (4) | MOL_B_1 (0) | MOL_B | 10x_multiome_brain |
| AAACCGAAGATGCCTG-1-10x_multiome_brain | INH_VIP (8) | INH_VIP (6) | INH_VIP (10) | INH_VIP (8) | INH_VIP | 10x_multiome_brain |
| AAACCGAAGTTAGCTA-1-10x_multiome_brain | MOL_A_1 (0) | MOL_A (1) | MOL_A_2 (0) | MOL_A_2 (1) | MOL_A | 10x_multiome_brain |
| AAACCGCGTCTTACTA-1-10x_multiome_brain | MOL_B_1 (1) | MOL_B (0) | MOL_B_2 (2) | MOL_B_2 (3) | MOL_B | 10x_multiome_brain |
| ... | ... | ... | ... | ... | ... | ... |
| TTTGTGAAGGGTGAGT-1-10x_multiome_brain | INH_VIP (8) | INH_VIP (6) | INH_VIP (10) | INH_VIP (8) | INH_VIP | 10x_multiome_brain |
| TTTGTGAAGTCAGGCC-1-10x_multiome_brain | AST_CER (2) | AST_CER (2) | AST_CER_1 (7) | AST_CER (2) | AST_CER | 10x_multiome_brain |
| TTTGTGGCATGCTTAG-1-10x_multiome_brain | MOL_B_1 (1) | MOL_B (0) | MOL_B_1 (1) | MOL_B_1 (0) | MOL_B | 10x_multiome_brain |
| TTTGTTGGTGATCAGC-1-10x_multiome_brain | MOL_A_1 (0) | MOL_A (1) | MOL_A_1 (11) | MOL_A_2 (1) | MOL_A | 10x_multiome_brain |
| TTTGTTGGTGATTTGG-1-10x_multiome_brain | INH_SST (7) | INH_SST (5) | INH_SST (9) | INH_SST (7) | INH_SST | 10x_multiome_brain |
2607 rows × 6 columns
In order to produce the bigwig files, we also need to know the overall size of the chromosomes. We can easily download this information from the UCSC.
# Get chromosome sizes (for hg38 here)
import pyranges as pr
import requests
target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
chromsizes=pd.read_csv(target_url, sep='\t', header=None)
chromsizes.columns=['Chromosome', 'End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
# Exceptionally in this case, to agree with CellRangerARC annotations
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].replace('v', '.') for x in range(len(chromsizes['Chromosome']))]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].split('_')[1] if len(chromsizes['Chromosome'][x].split('_')) > 1 else chromsizes['Chromosome'][x] for x in range(len(chromsizes['Chromosome']))]
chromsizes=pr.PyRanges(chromsizes)
Now we have all the ingredients we need to generate the pseudobulk files. With this function we will generate fragments files per group and the corresponding bigwigs. The mandatory input to this function are:
input_data)The output will be two dictionaries containing the paths to the bed and bigwig files, respectively, to each group.
from pycisTopic.pseudobulk_peak_calling import *
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
variable = 'cell_type',
chromsizes = chromsizes,
bed_path = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/',
bigwig_path = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bw_files/',
path_to_fragments = {'10x_multiome_brain':'/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/data/cell-arc/1.0.0/outs/human_brain_3k_atac_fragments.tsv.gz'},
n_cpu = 5,
normalize_bigwig = True,
remove_duplicates = True,
_temp_dir = '/scratch/leuven/313/vsc31305/ray_spill')
2021-08-05 17:14:48,563 cisTopic INFO Reading fragments from /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/data/cell-arc/1.0.0/outs/human_brain_3k_atac_fragments.tsv.gz
2021-08-05 17:18:14,108 INFO services.py:1245 -- View the Ray dashboard at http://127.0.0.1:8265
(pid=13819) 2021-08-05 17:18:16,966 cisTopic INFO Creating pseudobulk for AST (pid=13818) 2021-08-05 17:18:17,588 cisTopic INFO Creating pseudobulk for ASTP (pid=13817) 2021-08-05 17:18:18,244 cisTopic INFO Creating pseudobulk for AST_CER (pid=13820) 2021-08-05 17:18:18,909 cisTopic INFO Creating pseudobulk for ENDO (pid=13816) 2021-08-05 17:18:19,670 cisTopic INFO Creating pseudobulk for GC (pid=13820) 2021-08-05 17:18:28,780 cisTopic INFO ENDO done! (pid=13820) 2021-08-05 17:18:28,925 cisTopic INFO Creating pseudobulk for GP (pid=13819) 2021-08-05 17:18:33,541 cisTopic INFO AST done! (pid=13819) 2021-08-05 17:18:33,686 cisTopic INFO Creating pseudobulk for INH_PVALB (pid=13819) 2021-08-05 17:18:50,231 cisTopic INFO INH_PVALB done! (pid=13819) 2021-08-05 17:18:50,377 cisTopic INFO Creating pseudobulk for INH_SST (pid=13818) 2021-08-05 17:18:55,853 cisTopic INFO ASTP done! (pid=13818) 2021-08-05 17:18:56,006 cisTopic INFO Creating pseudobulk for INH_VIP (pid=13820) 2021-08-05 17:19:24,706 cisTopic INFO GP done! (pid=13820) 2021-08-05 17:19:24,851 cisTopic INFO Creating pseudobulk for MGL (pid=13820) 2021-08-05 17:19:45,535 cisTopic INFO MGL done! (pid=13820) 2021-08-05 17:19:45,673 cisTopic INFO Creating pseudobulk for MOL_A (pid=13818) 2021-08-05 17:20:25,281 cisTopic INFO INH_VIP done! (pid=13818) 2021-08-05 17:20:25,430 cisTopic INFO Creating pseudobulk for MOL_B (pid=13819) 2021-08-05 17:20:49,405 cisTopic INFO INH_SST done! (pid=13819) 2021-08-05 17:20:49,579 cisTopic INFO Creating pseudobulk for OPC (pid=13816) 2021-08-05 17:21:09,663 cisTopic INFO GC done! (pid=13816) 2021-08-05 17:21:09,847 cisTopic INFO Creating pseudobulk for PURK (pid=13817) 2021-08-05 17:21:27,230 cisTopic INFO AST_CER done! (pid=13816) 2021-08-05 17:21:39,640 cisTopic INFO PURK done! (pid=13819) 2021-08-05 17:22:00,242 cisTopic INFO OPC done! (pid=13820) 2021-08-05 17:22:05,518 cisTopic INFO MOL_A done!
Let's save the paths dictionaries.
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/bed_paths.pkl', 'wb') as f:
pickle.dump(bed_paths, f)
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/bw_paths.pkl', 'wb') as f:
pickle.dump(bw_paths, f)
This function can be used with cisTopic objects (as input_data), instead of the annotation data frame and the paths to fragments dictionary. You can find an example later.
In the following step, we will use MACS2 to call peaks in each group (in this case, cell type). The default parameters are those recommended for ATAC-seq data.
# Load bed paths
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/bed_paths.pkl', 'rb')
bed_paths = pickle.load(infile)
infile.close()
import os
from pycisTopic.pseudobulk_peak_calling import *
macs_path='macs2'
outdir = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/'
# Run peak calling
narrow_peaks_dict = peak_calling(macs_path,
bed_paths,
outdir,
genome_size='hs',
n_cpu=5,
input_format='BEDPE',
shift=73,
ext_size=146,
keep_dup = 'all',
q_value = 0.05,
_temp_dir = '/scratch/leuven/313/vsc31305/ray_spill')
2021-08-05 17:24:38,827 INFO services.py:1245 -- View the Ray dashboard at http://127.0.0.1:8265
(pid=17789) 2021-08-05 17:24:41,800 cisTopic INFO Calling peaks for AST_CER with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/AST_CER.bed.gz --name AST_CER --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17785) 2021-08-05 17:24:41,800 cisTopic INFO Calling peaks for GC with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/GC.bed.gz --name GC --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17786) 2021-08-05 17:24:41,953 cisTopic INFO Calling peaks for ASTP with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/ASTP.bed.gz --name ASTP --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17788) 2021-08-05 17:24:41,891 cisTopic INFO Calling peaks for AST with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/AST.bed.gz --name AST --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17787) 2021-08-05 17:24:41,979 cisTopic INFO Calling peaks for ENDO with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/ENDO.bed.gz --name ENDO --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17787) 2021-08-05 17:24:50,195 cisTopic INFO ENDO done! (pid=17787) 2021-08-05 17:24:50,207 cisTopic INFO Calling peaks for GP with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/GP.bed.gz --name GP --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17788) 2021-08-05 17:25:05,534 cisTopic INFO AST done! (pid=17788) 2021-08-05 17:25:05,556 cisTopic INFO Calling peaks for INH_PVALB with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/INH_PVALB.bed.gz --name INH_PVALB --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17788) 2021-08-05 17:25:13,229 cisTopic INFO INH_PVALB done! (pid=17788) 2021-08-05 17:25:13,240 cisTopic INFO Calling peaks for INH_SST with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/INH_SST.bed.gz --name INH_SST --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17786) 2021-08-05 17:25:13,372 cisTopic INFO ASTP done! (pid=17786) 2021-08-05 17:25:13,395 cisTopic INFO Calling peaks for INH_VIP with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/INH_VIP.bed.gz --name INH_VIP --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17787) 2021-08-05 17:25:41,290 cisTopic INFO GP done! (pid=17787) 2021-08-05 17:25:41,325 cisTopic INFO Calling peaks for MGL with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/MGL.bed.gz --name MGL --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17786) 2021-08-05 17:26:13,218 cisTopic INFO INH_VIP done! (pid=17786) 2021-08-05 17:26:13,250 cisTopic INFO Calling peaks for MOL_A with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/MOL_A.bed.gz --name MOL_A --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17787) 2021-08-05 17:26:15,017 cisTopic INFO MGL done! (pid=17787) 2021-08-05 17:26:15,041 cisTopic INFO Calling peaks for MOL_B with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/MOL_B.bed.gz --name MOL_B --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17785) 2021-08-05 17:26:17,061 cisTopic INFO GC done! (pid=17785) 2021-08-05 17:26:17,113 cisTopic INFO Calling peaks for OPC with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/OPC.bed.gz --name OPC --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17788) 2021-08-05 17:26:23,600 cisTopic INFO INH_SST done! (pid=17788) 2021-08-05 17:26:23,648 cisTopic INFO Calling peaks for PURK with macs2 callpeak --treatment /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/PURK.bed.gz --name PURK --outdir /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda (pid=17789) 2021-08-05 17:26:27,725 cisTopic INFO AST_CER done! (pid=17788) 2021-08-05 17:26:45,543 cisTopic INFO PURK done! (pid=17785) 2021-08-05 17:27:15,840 cisTopic INFO OPC done! (pid=17786) 2021-08-05 17:27:27,155 cisTopic INFO MOL_A done! (pid=17787) 2021-08-05 17:27:42,793 cisTopic INFO MOL_B done!
Let's save the narrow peaks dictionary (with a PyRanges with the narrow peaks for each cell type).
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/MACS/narrow_peaks_dict.pkl', 'wb') as f:
pickle.dump(narrow_peaks_dict, f)
Finally, it is time to derive the consensus peaks. To do so, we use the TGCA iterative peak filtering approach. First, each summit is extended a peak_half_width in each direction and then we iteratively filter out less significant peaks that overlap with a more significant one. During this procedure peaks will be merged and depending on the number of peaks included into them, different processes will happen:
This proccess will happen twice, first in each pseudobulk peaks; and after peak score normalization, to process all peaks together.
# Get chromosome sizes (for hg38 here). We need them to ensure that extending the summits we don't fall out of the chromosome.
import pyranges as pr
import requests
target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
chromsizes=pd.read_csv(target_url, sep='\t', header=None)
chromsizes.columns=['Chromosome', 'End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
# Exceptionally in this case, to agree with CellRangerARC annotations
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].replace('v', '.') for x in range(len(chromsizes['Chromosome']))]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].split('_')[1] if len(chromsizes['Chromosome'][x].split('_')) > 1 else chromsizes['Chromosome'][x] for x in range(len(chromsizes['Chromosome']))]
chromsizes=pr.PyRanges(chromsizes)
from pycisTopic.iterative_peak_calling import *
# Other param
peak_half_width=250
path_to_blacklist='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/pycisTopic/blacklist/hg38-blacklist.v2.bed'
# Get consensus peaks
consensus_peaks=get_consensus_peaks(narrow_peaks_dict, peak_half_width, chromsizes=chromsizes, path_to_blacklist=path_to_blacklist)
2021-08-05 17:28:38,316 cisTopic INFO Extending and merging peaks per class 2021-08-05 17:30:00,123 cisTopic INFO Normalizing peak scores 2021-08-05 17:30:00,662 cisTopic INFO Merging peaks Warning! Start and End columns now have different dtypes: int64 and int32 2021-08-05 17:31:47,110 cisTopic INFO Done!
# Write to bed
consensus_peaks.to_bed(path='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/consensus_regions.bed', keep=True, compression='infer', chain=False)
The next step is to perform QC in the scATAC-seq samples (in this case, only one run). There are several measurements and visualizations performed in this step:
To calculate the TSS enrichment we need to provide TSS annotations. You can easily download them via pybiomart.
# Get TSS annotations
import pybiomart as pbm
# For mouse (mm39)
#dataset = pbm.Dataset(name='mmusculus_gene_ensembl', host='http://www.ensembl.org')
# For mouse (mm10)
#dataset = pbm.Dataset(name='mmusculus_gene_ensembl', host='http://nov2020.archive.ensembl.org/')
# For human (hg38)
dataset = pbm.Dataset(name='hsapiens_gene_ensembl', host='http://www.ensembl.org')
# For human (hg19)
#dataset = pbm.Dataset(name='hsapiens_gene_ensembl', host='http://grch37.ensembl.org/')
# For fly (dm6)
# dataset = pbm.Dataset(name='dmelanogaster_gene_ensembl', host='http://www.ensembl.org')
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
filter = annot['Chromosome/scaffold name'].str.contains('CHR|GL|JH|MT')
annot = annot[~filter]
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].str.replace(r'(\b\S)', r'chr\1')
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']
/tmp/ipykernel_10364/276247402.py:16: FutureWarning: The default value of regex will change from True to False in a future version. annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].str.replace(r'(\b\S)', r'chr\1')
If you want to run all (or several of) the metrics, you can use the compute_qc_stats() function. As input you need to provide a dictionary containing the fragments files per sample and another dictionary the corresponding regions to use to estimate the FRIP. For more details in the QC stats, see the QC advanced tutorial.
from pycisTopic.qc import *
fragments_dict = {'10x_multiome_brain':'/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/data/cell-arc/1.0.0/outs/human_brain_3k_atac_fragments.tsv.gz'}
path_to_regions = {'10x_multiome_brain':'/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/consensus_regions.bed'}
metadata_bc, profile_data_dict = compute_qc_stats(fragments_dict = fragments_dict,
tss_annotation = annot,
stats=['barcode_rank_plot', 'duplicate_rate', 'insert_size_distribution', 'profile_tss', 'frip'],
label_list = None,
path_to_regions = path_to_regions,
n_cpu = 1,
valid_bc = None,
n_frag = 100,
n_bc = None,
tss_flank_window = 1000,
tss_window = 50,
tss_minimum_signal_window = 100,
tss_rolling_window = 10,
remove_duplicates = True,
_temp_dir = '/scratch/leuven/313/vsc31305/ray_spill')
2021-08-05 17:35:03,056 INFO services.py:1245 -- View the Ray dashboard at http://127.0.0.1:8265
(pid=22662) 2021-08-05 17:35:06,014 cisTopic INFO Reading 10x_multiome_brain (pid=22662) 2021-08-05 17:38:19,960 cisTopic INFO Computing barcode rank plot for 10x_multiome_brain (pid=22662) 2021-08-05 17:38:19,961 cisTopic INFO Counting fragments (pid=22662) 2021-08-05 17:38:29,726 cisTopic INFO Marking barcodes with more than 100 (pid=22662) 2021-08-05 17:38:29,795 cisTopic INFO Returning plot data (pid=22662) 2021-08-05 17:38:29,797 cisTopic INFO Returning valid barcodes (pid=22662) 2021-08-05 17:38:42,597 cisTopic INFO Computing duplicate rate plot for 10x_multiome_brain (pid=22662) 2021-08-05 17:38:55,732 cisTopic INFO Return plot data (pid=22662) 2021-08-05 17:38:55,809 cisTopic INFO Computing insert size distribution for 10x_multiome_brain (pid=22662) 2021-08-05 17:38:55,810 cisTopic INFO Counting fragments (pid=22662) 2021-08-05 17:39:02,662 cisTopic INFO Returning plot data (pid=22662) 2021-08-05 17:40:19,309 cisTopic INFO Computing TSS profile for 10x_multiome_brain (pid=22662) 2021-08-05 17:40:30,467 cisTopic INFO Formatting annnotation (pid=22662) 2021-08-05 17:40:30,543 cisTopic INFO Creating coverage matrix (pid=22662) 2021-08-05 17:43:05,582 cisTopic INFO Coverage matrix done (pid=22662) 2021-08-05 17:43:15,512 cisTopic INFO Returning normalized TSS coverage matrix per barcode (pid=22662) 2021-08-05 17:43:22,843 cisTopic INFO Returning normalized sample TSS enrichment data (pid=22662) 2021-08-05 17:43:22,968 cisTopic INFO Computing FRIP profile for 10x_multiome_brain (pid=22662) 2021-08-05 17:43:23,734 cisTopic INFO Counting fragments (pid=22662) 2021-08-05 17:43:53,664 cisTopic INFO Intersecting fragments with regions (pid=22662) 2021-08-05 17:44:46,129 cisTopic INFO Sample 10x_multiome_brain done!
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/quality_control/metadata_bc.pkl', 'wb') as f:
pickle.dump(metadata_bc, f)
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/quality_control/profile_data_dict.pkl', 'wb') as f:
pickle.dump(profile_data_dict, f)
Once the QC metrics have been computed you can visualize the results at the sample-level and the barcode-level. Sample-level statistics can be used to assess the overall quality of the sample, while barcode level statistics can be use to differentiate good quality cells versus the rest. The sample-level graphs include:
Barcode rank plot: The barcode rank plot shows the distribution of non-duplicate reads and which barcodes were inferred to be associated with cells. A steep drop-off ('knee') is indicative of good separation between the cell-associated barcodes and the barcodes associated with empty partitions.
Insertion size: ATAC-seq requires a proper pair of Tn5 transposase cutting events at the ends of DNA. In the nucleosome-free open chromatin regions, many molecules of Tn5 can kick in and chop the DNA into small pieces; around nucleosome-occupied regions, and Tn5 can only access the linker regions. Therefore, in a good ATAC-seq library, you should expect to see a sharp peak at the <100 bp region (open chromatin), and a peak at ~200bp region (mono-nucleosome), and other larger peaks (multi-nucleosomes). A clear nucleosome pattern indicates a good quality of the experiment.
Sample TSS enrichment: The TSS enrichment calculation is a signal to noise calculation. The reads around a reference set of TSSs are collected to form an aggregate distribution of reads centered on the TSSs and extending to 1000 bp in either direction (for a total of 2000bp). This distribution is then normalized by taking the average read depth in the 100 bps at each of the end flanks of the distribution (for a total of 200bp of averaged data) and calculating a fold change at each position over that average read depth. This means that the flanks should start at 1, and if there is high read signal at transcription start sites (highly open regions of the genome) there should be an increase in signal up to a peak in the middle.
FRIP distribution: Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads. In general, FRiP scores correlate positively with the number of regions. A low FRIP indicates that many reads form part of the background, and so that the data is noisy.
Duplication rate: A fragment is considered “usable” if it uniquely maps to the genome and remains after removing PCR duplicates (defined as two fragments that map to the same genomic position and have the same unique molecular identifier). The duplication rate serves to estimate the amount of usable reads per barcode. High duplication rates may indicate over-sequencing or lack of fragments after transposition and encapsulation.
# Load sample metrics
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/quality_control/profile_data_dict.pkl', 'rb')
profile_data_dict = pickle.load(infile)
infile.close()
from pycisTopic.qc import *
plot_sample_metrics(profile_data_dict,
insert_size_distriubtion_xlim=[0,600],
ncol=5,
plot=True,
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/quality_control/sample_metrics.pdf')
/opt/venv/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots). warnings.warn(msg, FutureWarning)
Barcode-level statistics can be used to select high quality cells. Typical measurements that can be used are:
# Load barcode metrics
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/quality_control/metadata_bc.pkl', 'rb')
metadata_bc = pickle.load(infile)
infile.close()
# Return figure to plot together with other metrics, and cells passing filters. Figure will be saved as pdf.
from pycisTopic.qc import *
FRIP_NR_FRAG_fig, FRIP_NR_FRAG_filter=plot_barcode_metrics(metadata_bc['10x_multiome_brain'],
var_x='Log_unique_nr_frag',
var_y='FRIP',
min_x=3.5,
max_x=None,
min_y=0.2,
max_y=None,
return_cells=True,
return_fig=True,
plot=False,
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/quality_control/barcode_metrics/FRIP-VS-NRFRAG.pdf')
# Return figure to plot together with other metrics, and cells passing filters
TSS_NR_FRAG_fig, TSS_NR_FRAG_filter=plot_barcode_metrics(metadata_bc['10x_multiome_brain'],
var_x='Log_unique_nr_frag',
var_y='TSS_enrichment',
min_x=3.5,
max_x=None,
min_y=4,
max_y=None,
return_cells=True,
return_fig=True,
plot=False,
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/quality_control/barcode_metrics/TSS-VS-NRFRAG.pdf')
# Return figure to plot together with other metrics, but not returning cells (no filter applied for the duplication rate per barcode)
DR_NR_FRAG_fig=plot_barcode_metrics(metadata_bc['10x_multiome_brain'],
var_x='Log_unique_nr_frag',
var_y='Dupl_rate',
min_x=3.5,
max_x=None,
min_y=None,
max_y=None,
return_cells=False,
return_fig=True,
plot=False,
plot_as_hexbin = True)
/opt/venv/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /opt/venv/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /opt/venv/lib/python3.8/site-packages/seaborn/distributions.py:1647: FutureWarning: The `vertical` parameter is deprecated and will be removed in a future version. Assign the data to the `y` variable instead. warnings.warn(msg, FutureWarning) /opt/venv/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /opt/venv/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /opt/venv/lib/python3.8/site-packages/seaborn/distributions.py:1647: FutureWarning: The `vertical` parameter is deprecated and will be removed in a future version. Assign the data to the `y` variable instead. warnings.warn(msg, FutureWarning) /opt/venv/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /opt/venv/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) /opt/venv/lib/python3.8/site-packages/seaborn/distributions.py:1647: FutureWarning: The `vertical` parameter is deprecated and will be removed in a future version. Assign the data to the `y` variable instead. warnings.warn(msg, FutureWarning)
# Plot barcode stats in one figure
fig=plt.figure(figsize=(40,10))
plt.subplot(1, 3, 1)
img = fig2img(FRIP_NR_FRAG_fig) #To convert figures to png to plot together, see .utils.py. This converts the figure to png.
plt.imshow(img)
plt.axis('off')
plt.subplot(1, 3, 2)
img = fig2img(TSS_NR_FRAG_fig)
plt.imshow(img)
plt.axis('off')
plt.subplot(1, 3, 3)
img = fig2img(DR_NR_FRAG_fig)
plt.imshow(img)
plt.axis('off')
plt.show()
We select now the cells passing filters:
bc_passing_filters = {'10x_multiome_brain':[]}
bc_passing_filters['10x_multiome_brain'] = list((set(FRIP_NR_FRAG_filter) & set(TSS_NR_FRAG_filter)))
We have a total of 2,925 selected barcodes.
len(bc_passing_filters['10x_multiome_brain'])
2925
Of these, a total of 2,240 overlaps with high quality scRNA-seq barcodes. We will keep the additional barcodes for now.
# Get metadata from high-quality loom file
from pycisTopic.loom import *
path_to_annotated_rna_loom = "/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/add_clusters_scrublet_as_annot/10x_multiome_brain_SCENIC_SCope_output_wAnnot_noDBL.loom"
loom = SCopeLoom.read_loom(path_to_annotated_rna_loom)
cell_data = get_metadata(loom)
scRNA_bc=[re.sub("-10x_multiome_brain", "", x) for x in cell_data.index.tolist()]
len(list(set(bc_passing_filters['10x_multiome_brain']) & set(scRNA_bc)))
2240
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/quality_control/bc_passing_filters.pkl', 'wb') as f:
pickle.dump(bc_passing_filters, f)
In this step a fragments count matrix will be generated, in which the fragments in each region for each barcode is indicated. If you would like to start from a precomputed fragments count matrix it is also possible (see advanced tutorial on Initializing cisTopic objects). For multiple samples, you can add additional entries in fragment_dict. As regions, we will use the consensus peaks derived from the scRNA-seq annotations. This cisTopic object will contain:
# Path to fragments
fragments_dict = {'10x_multiome_brain':'/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/data/cell-arc/1.0.0/outs/human_brain_3k_atac_fragments.tsv.gz'}
# Path to regions
path_to_regions = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/consensus_regions.bed'
# Blacklist
path_to_blacklist='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/pycisTopic/blacklist/hg38-blacklist.v2.bed'
# Metrics
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/quality_control/metadata_bc.pkl', 'rb')
metadata_bc = pickle.load(infile)
infile.close()
# Valid barcodes
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/quality_control/bc_passing_filters.pkl', 'rb')
bc_passing_filters = pickle.load(infile)
infile.close()
#Create objects
cistopic_obj_list=[create_cistopic_object_from_fragments(path_to_fragments=fragments_dict[key],
path_to_regions=path_to_regions,
path_to_blacklist=path_to_blacklist,
metrics=metadata_bc[key],
valid_bc=bc_passing_filters[key],
n_cpu=1,
project=key) for key in fragments_dict.keys()]
2021-08-05 17:47:47,022 cisTopic INFO Reading data for 10x_multiome_brain 2021-08-05 17:50:50,249 cisTopic INFO metrics provided! 2021-08-05 17:51:01,194 cisTopic INFO valid_bc provided, selecting barcodes! 2021-08-05 17:51:11,552 cisTopic INFO Counting fragments in regions 2021-08-05 17:52:06,407 cisTopic INFO Creating fragment matrix 2021-08-05 17:53:42,874 cisTopic INFO Converting fragment matrix to sparse matrix 2021-08-05 17:54:00,911 cisTopic INFO Removing blacklisted regions 2021-08-05 17:54:02,756 cisTopic INFO Creating CistopicObject 2021-08-05 17:54:06,022 cisTopic INFO Done!
In this case we only have one sample, so only one cisTopic object has been generated. If you would have multiple samples, you would need to run the merge() function in your cisTopic object list. You can find more information in the advanced tutorial on Initializing cisTopic objects.
cistopic_obj = cistopic_obj_list[0]
print(cistopic_obj)
CistopicObject from project 10x_multiome_brain with n_cells × n_regions = 2925 × 435824
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject.pkl', 'wb') as f:
pickle.dump(cistopic_obj, f)
We can add additional metadata (for regions or cells) to a cisTopic object. For example, let's add the scRNA-seq data annotations. For those barcodes that did not pass the scRNA-seq values will be filled with Nan.
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
# Get metadata from loom file
from pycisTopic.loom import *
path_to_annotated_rna_loom = "/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/add_clusters_scrublet_as_annot/10x_multiome_brain_SCENIC_SCope_output_wAnnot_noDBL-2.loom"
loom = SCopeLoom.read_loom(path_to_annotated_rna_loom)
cell_data = get_metadata(loom).drop('sample_id', axis=1)
The indexes in the pandas data frame to add can be cell barcodes (if the cisTopic object has been created from a fragments file only) or an exact match with the cell names in the cisTopic object (cistopic_obj.cell_names).
cistopic_obj.add_cell_data(cell_data)
Warning: Some cells in this CistopicObject are not present in this cell_data. Values will be filled with Nan
cistopic_obj.cell_data
| cisTopic_nr_frag | cisTopic_log_nr_frag | cisTopic_nr_acc | cisTopic_log_nr_acc | sample_id | Log_total_nr_frag | Log_unique_nr_frag | Total_nr_frag | Unique_nr_frag | Dupl_nr_frag | ... | Total_nr_frag_in_regions | Unique_nr_frag_in_regions | FRIP | TSS_enrichment | barcode | leiden_res0.9 | leiden_res0.3 | leiden_res1.2 | leiden_res0.6 | cell_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CACCTCAGTTGTAAAC-1-10x_multiome_brain | 18291 | 4.262237 | 15553 | 4.191814 | 10x_multiome_brain | 4.680707 | 4.477873 | 47941 | 30052 | 17889 | ... | 28449 | 17246 | 0.573872 | 6.397778 | CACCTCAGTTGTAAAC-1 | NaN | NaN | NaN | NaN | NaN |
| TGACTCCTCATCCACC-1-10x_multiome_brain | 100115 | 5.000499 | 62314 | 4.794586 | 10x_multiome_brain | 5.493496 | 5.250254 | 311527 | 177932 | 133595 | ... | 174479 | 95349 | 0.535873 | 5.996917 | TGACTCCTCATCCACC-1 | AST_CER (2) | AST_CER (2) | AST_CER_1 (7) | AST_CER (2) | AST_CER |
| TTTCTCACATAAACCT-1-10x_multiome_brain | 32144 | 4.5071 | 26978 | 4.43101 | 10x_multiome_brain | 5.044732 | 4.826781 | 110849 | 67109 | 43740 | ... | 52345 | 30787 | 0.458761 | 4.420944 | TTTCTCACATAAACCT-1 | GC (6) | GC (4) | GC (8) | GC (5) | GC |
| GTCCTCCCACACAATT-1-10x_multiome_brain | 88467 | 4.946781 | 58322 | 4.765832 | 10x_multiome_brain | 5.513115 | 5.242422 | 325923 | 174752 | 151171 | ... | 164707 | 84274 | 0.482249 | 5.832565 | GTCCTCCCACACAATT-1 | AST_CER (2) | AST_CER (2) | AST_CER_1 (7) | AST_CER (2) | AST_CER |
| CTCCGTCCAGTTTGTG-1-10x_multiome_brain | 131129 | 5.117699 | 78551 | 4.895152 | 10x_multiome_brain | 5.608400 | 5.366159 | 405882 | 232359 | 173523 | ... | 226492 | 123971 | 0.533532 | 6.202113 | CTCCGTCCAGTTTGTG-1 | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| GGCTCACAGGCCCAGT-1-10x_multiome_brain | 6184 | 3.791269 | 5723 | 3.757624 | 10x_multiome_brain | 4.424326 | 4.237996 | 26566 | 17298 | 9268 | ... | 9255 | 5871 | 0.339403 | 4.609730 | GGCTCACAGGCCCAGT-1 | INH_VIP (8) | INH_VIP (6) | INH_VIP (10) | INH_VIP (8) | INH_VIP |
| AAGCTCCCAGCACCAT-1-10x_multiome_brain | 2189 | 3.340246 | 2049 | 3.311542 | 10x_multiome_brain | 3.995854 | 3.765818 | 9905 | 5832 | 4073 | ... | 3740 | 2063 | 0.353738 | 7.883237 | AAGCTCCCAGCACCAT-1 | NaN | NaN | NaN | NaN | NaN |
| TAGCCTGAGGTGAGAC-1-10x_multiome_brain | 2942 | 3.468643 | 2831 | 3.45194 | 10x_multiome_brain | 3.931610 | 3.678245 | 8543 | 4767 | 3776 | ... | 5258 | 2787 | 0.584644 | 7.596965 | TAGCCTGAGGTGAGAC-1 | MOL_A_1 (0) | MOL_A (1) | MOL_A_2 (0) | MOL_A_2 (1) | MOL_A |
| GCAGGTTGTCCAAATG-1-10x_multiome_brain | 2773 | 3.44295 | 2615 | 3.417472 | 10x_multiome_brain | 3.855519 | 3.606704 | 7170 | 4043 | 3127 | ... | 4794 | 2610 | 0.645560 | 8.818669 | GCAGGTTGTCCAAATG-1 | NaN | NaN | NaN | NaN | NaN |
| GTGCGCAGTGCTTAGA-1-10x_multiome_brain | 5841 | 3.766487 | 5500 | 3.740363 | 10x_multiome_brain | 4.401332 | 4.168380 | 25196 | 14736 | 10460 | ... | 9874 | 5567 | 0.377782 | 4.820033 | GTGCGCAGTGCTTAGA-1 | INH_SST (7) | INH_SST (5) | INH_SST (9) | INH_SST (7) | INH_SST |
2925 rows × 21 columns
Optionally, you can run also scrublet on the fragment count matrix to infer doublets from the scATAC-seq.
import scrublet as scr
scrub = scr.Scrublet(cistopic_obj.fragment_matrix.T, expected_doublet_rate=0.1)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
scrub.plot_histogram();
scrub.call_doublets(threshold=0.22)
scrub.plot_histogram();
scrublet = pd.DataFrame([scrub.doublet_scores_obs_, scrub.predicted_doublets_], columns=cistopic_obj.cell_names, index=['Doublet_scores_fragments', 'Predicted_doublets_fragments']).T
Preprocessing...
/opt/venv/lib/python3.8/site-packages/scrublet/helper_functions.py:252: RuntimeWarning: invalid value encountered in sqrt CV_input = np.sqrt(b);
Simulating doublets... Embedding transcriptomes using PCA... Calculating doublet scores... Automatically set threshold at doublet score = 0.65 Detected doublet rate = 0.2% Estimated detectable doublet fraction = 2.1% Overall doublet rate: Expected = 10.0% Estimated = 9.7% Elapsed time: 32.7 seconds Detected doublet rate = 21.6% Estimated detectable doublet fraction = 67.4% Overall doublet rate: Expected = 10.0% Estimated = 32.1%
cistopic_obj.add_cell_data(scrublet)
633 cells are called as doublets.
sum(cistopic_obj.cell_data.Predicted_doublets_fragments == True)
633
# Save with doublets
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject.pkl', 'wb') as f:
pickle.dump(cistopic_obj, f)
We can subset all cells marked as singlets from the cisTopic object.
# Remove doublets
singlets = cistopic_obj.cell_data[cistopic_obj.cell_data.Predicted_doublets_fragments == False].index.tolist()
# Subset cisTopic object
cistopic_obj_noDBL = cistopic_obj.subset(singlets, copy=True)
print(cistopic_obj_noDBL)
CistopicObject from project 10x_multiome_brain with n_cells × n_regions = 2292 × 435822
# Save without doublets
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'wb') as f:
pickle.dump(cistopic_obj_noDBL, f)
We can also take only cells overlapping with high quality cells in the scRNA-seq. In total, there are 2,159 cells.
# Remove cells without rna counterpart
rna_cells = cistopic_obj_noDBL.cell_data.dropna().index.tolist()
# Subset cisTopic object
cistopic_obj_noDBL_wRNA = cistopic_obj_noDBL.subset(rna_cells, copy=True)
print(cistopic_obj_noDBL_wRNA)
CistopicObject from project 10x_multiome_brain with n_cells × n_regions = 1885 × 435817
# Save without doublets
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL_wRNA.pkl', 'wb') as f:
pickle.dump(cistopic_obj_noDBL_wRNA, f)
The next step is to run the LDA models. There are two types of LDA models (with Collapsed Gibbs Sampling) you can run:
runCGSModels().runCGSModelsMallet().In this tutorial we will use the output of runCGSModelsMallet(). For more details in how to run models, see the advanced tutorial on Running LDA models.
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
models=run_cgs_models(cistopic_obj,
n_topics=[2,5,10,15,20,25],
n_cpu=6,
n_iter=100,
random_state=555,
alpha=50,
alpha_by_topic=True,
eta=0.1,
eta_by_topic=False,
save_path=None)
2021-01-26 17:52:50,478 INFO resource_spec.py:212 -- Starting Ray with 499.37 GiB memory available for workers and up to 186.26 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2021-01-26 17:52:51,213 WARNING services.py:923 -- Redis failed to start, retrying now.
2021-01-26 17:52:52,152 INFO services.py:1165 -- View the Ray dashboard at localhost:8265
(pid=779) 2021-01-26 17:53:25,832 cisTopic INFO Running model with 2 topics (pid=784) 2021-01-26 17:53:25,831 cisTopic INFO Running model with 10 topics (pid=783) 2021-01-26 17:53:25,831 cisTopic INFO Running model with 15 topics (pid=780) 2021-01-26 17:53:25,833 cisTopic INFO Running model with 5 topics (pid=782) 2021-01-26 17:53:25,831 cisTopic INFO Running model with 20 topics (pid=785) 2021-01-26 17:53:25,829 cisTopic INFO Running model with 25 topics (pid=779) 2021-01-26 18:05:55,065 cisTopic INFO Model with 2 topics done! (pid=780) 2021-01-26 18:10:09,850 cisTopic INFO Model with 5 topics done! (pid=784) 2021-01-26 18:15:52,994 cisTopic INFO Model with 10 topics done! (pid=783) 2021-01-26 18:21:18,347 cisTopic INFO Model with 15 topics done! (pid=782) 2021-01-26 18:26:05,991 cisTopic INFO Model with 20 topics done! (pid=785) 2021-01-26 18:34:25,387 cisTopic INFO Model with 25 topics done!
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/models/10x_multiome_brain_models_100_iter_noDBL.pkl', 'wb') as f:
pickle.dump(models, f)
If you are working on a cluster and want to run several models, we recommend to submit this step as a job. You can use the runModels_lda.py script.
#!/bin/bash -l
## Job will last 6 hours.
#PBS -l walltime=6:00:00
## Job needs 1 nodes and 24 cores per node.
#PBS -l nodes=1:ppn=28
## Job request memory
#PBS -lmem=180gb # or 180gb
## Specify project credits name to use credits for running the job.
#PBS -A lp_symbiosys
## Batch job name.
#PBS -N 10x_multiomics_brain
## Email options.
#PBS -m abe
#PBS -M carmen.bravogonzalezblas@kuleuven.vib.be
module load Python/3.7.4-foss-2018a
cd /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/pycisTopic
python job_template/runModels_lda.py \
-i /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl \
-o /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/models/10x_multiome_brain_models_500_iter_noDBL.pkl \
-nt 2,5,10,15,20,25,30,35,40,45,50 \
-c 11 \
-it 500 \
-a 50 \
-abt True \
-e 0.1 \
-ebt False \
-sp None \
-s 555
Mallet provides a faster implementation of LDA, that allows to run calculations in parallel within a model. This option is strongly recommended with large data sets. To run mallet, you can use runCGSModelsMallet().
# Load functions
from pycisTopic.lda_models import *
# Configure path Mallet
path_to_mallet_binary='mallet'
# Run models
models=run_cgs_models_mallet(path_to_mallet_binary,
cistopic_obj,
n_topics=[2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50],
n_cpu=12,
n_iter=500,
random_state=555,
alpha=50,
alpha_by_topic=True,
eta=0.1,
eta_by_topic=False,
tmp_path='/scratch/leuven/313/vsc31305/tmp/mallet/tutorial/', #Use SCRATCH if many models or big data set
save_path='/scratch/leuven/313/vsc31305/tmp/mallet/tutorial/')
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/models/10x_multiome_brain_models_500_iter_noDBL_mallet.pkl', 'wb') as f:
pickle.dump(models, f)
There are several methods that can be used for model selection:
For scATAC-seq data models, the most helpful methods are Minmo (topic coherence) and the log-likelihood in the last iteration.
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
# Load models
from pycisTopic.lda_models import *
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic_v1/models/10x_multiome_brain_models_500_iter_noDBL.pkl', 'rb')
models = pickle.load(infile)
infile.close()
model=evaluate_models(models,
select_model=40,
return_model=True,
metrics=['Arun_2010','Cao_Juan_2009', 'Minmo_2011', 'loglikelihood'],
plot_metrics=False,
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/models/model_selection.pdf')
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
# Add model to cisTopicObject
cistopic_obj.add_LDA_model(model)
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'wb') as f:
pickle.dump(cistopic_obj, f)
We can cluster the cells (or regions) using the leiden algorithm, and perfor dimensionality reductiion with UMAP and TSNE. In these examples we will focus on the cells only. For these steps, the cell-topic contriibutions of the model will be used.
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
from pycisTopic.clust_vis import *
find_clusters(cistopic_obj,
target = 'cell',
k = 10,
res = [0.6],
prefix = 'pycisTopic_',
scale = True)
2021-08-05 18:04:47,102 cisTopic INFO Finding neighbours
run_umap(cistopic_obj,
target = 'cell', scale=True)
2021-08-05 18:04:49,957 cisTopic INFO Running UMAP
run_tsne(cistopic_obj,
target = 'cell', scale=True)
2021-08-05 18:04:58,582 cisTopic INFO Running FItSNE Symmetrizing... Using the given initialization. Exaggerating Ps by 12.000000 Input similarities computed (sparsity = 0.056813)! Learning embedding... Using FIt-SNE approximation. Iteration 50 (50 iterations in 0.52 seconds), cost 3.057269 Iteration 100 (50 iterations in 0.47 seconds), cost 2.708559 Iteration 150 (50 iterations in 0.48 seconds), cost 2.596058 Iteration 200 (50 iterations in 0.48 seconds), cost 2.544019 Iteration 250 (50 iterations in 0.49 seconds), cost 2.602994 Unexaggerating Ps by 12.000000 Iteration 300 (50 iterations in 0.48 seconds), cost 1.635262 Iteration 350 (50 iterations in 0.48 seconds), cost 1.387551 Iteration 400 (50 iterations in 0.48 seconds), cost 1.315359 Iteration 450 (50 iterations in 0.56 seconds), cost 1.241484 Iteration 500 (50 iterations in 0.77 seconds), cost 1.233023 Iteration 550 (50 iterations in 0.92 seconds), cost 1.207155 Iteration 600 (50 iterations in 1.02 seconds), cost 1.201735 Iteration 650 (50 iterations in 1.09 seconds), cost 1.191511 Iteration 700 (50 iterations in 1.16 seconds), cost 1.194142 Iteration 750 (50 iterations in 1.11 seconds), cost 1.179365 Will use momentum during exaggeration phase Computing input similarities... Using perplexity, so normalizing input data (to prevent numerical problems) Using perplexity, not the manually set kernel width. K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored. Using ANNOY for knn search, with parameters: n_trees 50 and search_k 4500 Going to allocate memory. N: 2292, K: 90, N*K = 206280 Building Annoy tree... Done building tree. Beginning nearest neighbor search... parallel (36 threads): [===========================================================>] 99% 0.047s
The clustering assignments are added to cistopic_obj.cell_data and the projections to cistopic_obj.projections['cell']. If you would like to add additional dimensionality reductions, you can just add them as an entry to the projections dictionary (under 'cell' in this case).
cistopic_obj.cell_data
| cisTopic_nr_frag | cisTopic_log_nr_frag | cisTopic_nr_acc | cisTopic_log_nr_acc | sample_id | Log_total_nr_frag | Log_unique_nr_frag | Total_nr_frag | Unique_nr_frag | Dupl_nr_frag | ... | TSS_enrichment | barcode | leiden_res0.9 | leiden_res0.3 | leiden_res1.2 | leiden_res0.6 | cell_type | Doublet_scores_fragments | Predicted_doublets_fragments | pycisTopic_leiden_10_0.6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CACCTCAGTTGTAAAC-1-10x_multiome_brain | 18291 | 4.262237 | 15553 | 4.191814 | 10x_multiome_brain | 4.680707 | 4.477873 | 47941 | 30052 | 17889 | ... | 6.397778 | CACCTCAGTTGTAAAC-1 | NaN | NaN | NaN | NaN | NaN | 0.04914 | False | 5 |
| TGACTCCTCATCCACC-1-10x_multiome_brain | 100115 | 5.000499 | 62314 | 4.794586 | 10x_multiome_brain | 5.493496 | 5.250254 | 311527 | 177932 | 133595 | ... | 5.996917 | TGACTCCTCATCCACC-1 | AST_CER (2) | AST_CER (2) | AST_CER_1 (7) | AST_CER (2) | AST_CER | 0.114173 | False | 3 |
| TTTCTCACATAAACCT-1-10x_multiome_brain | 32144 | 4.5071 | 26978 | 4.43101 | 10x_multiome_brain | 5.044732 | 4.826781 | 110849 | 67109 | 43740 | ... | 4.420944 | TTTCTCACATAAACCT-1 | GC (6) | GC (4) | GC (8) | GC (5) | GC | 0.028971 | False | 8 |
| GTCCTCCCACACAATT-1-10x_multiome_brain | 88467 | 4.946781 | 58322 | 4.765832 | 10x_multiome_brain | 5.513115 | 5.242422 | 325923 | 174752 | 151171 | ... | 5.832565 | GTCCTCCCACACAATT-1 | AST_CER (2) | AST_CER (2) | AST_CER_1 (7) | AST_CER (2) | AST_CER | 0.089376 | False | 3 |
| CTCCGTCCAGTTTGTG-1-10x_multiome_brain | 131129 | 5.117699 | 78551 | 4.895152 | 10x_multiome_brain | 5.608400 | 5.366159 | 405882 | 232359 | 173523 | ... | 6.202113 | CTCCGTCCAGTTTGTG-1 | NaN | NaN | NaN | NaN | NaN | 0.201183 | False | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| GGCTCACAGGCCCAGT-1-10x_multiome_brain | 6184 | 3.791269 | 5723 | 3.757624 | 10x_multiome_brain | 4.424326 | 4.237996 | 26566 | 17298 | 9268 | ... | 4.609730 | GGCTCACAGGCCCAGT-1 | INH_VIP (8) | INH_VIP (6) | INH_VIP (10) | INH_VIP (8) | INH_VIP | 0.024715 | False | 7 |
| AAGCTCCCAGCACCAT-1-10x_multiome_brain | 2189 | 3.340246 | 2049 | 3.311542 | 10x_multiome_brain | 3.995854 | 3.765818 | 9905 | 5832 | 4073 | ... | 7.883237 | AAGCTCCCAGCACCAT-1 | NaN | NaN | NaN | NaN | NaN | 0.016225 | False | 1 |
| TAGCCTGAGGTGAGAC-1-10x_multiome_brain | 2942 | 3.468643 | 2831 | 3.45194 | 10x_multiome_brain | 3.931610 | 3.678245 | 8543 | 4767 | 3776 | ... | 7.596965 | TAGCCTGAGGTGAGAC-1 | MOL_A_1 (0) | MOL_A (1) | MOL_A_2 (0) | MOL_A_2 (1) | MOL_A | 0.044811 | False | 0 |
| GCAGGTTGTCCAAATG-1-10x_multiome_brain | 2773 | 3.44295 | 2615 | 3.417472 | 10x_multiome_brain | 3.855519 | 3.606704 | 7170 | 4043 | 3127 | ... | 8.818669 | GCAGGTTGTCCAAATG-1 | NaN | NaN | NaN | NaN | NaN | 0.019643 | False | 1 |
| GTGCGCAGTGCTTAGA-1-10x_multiome_brain | 5841 | 3.766487 | 5500 | 3.740363 | 10x_multiome_brain | 4.401332 | 4.168380 | 25196 | 14736 | 10460 | ... | 4.820033 | GTGCGCAGTGCTTAGA-1 | INH_SST (7) | INH_SST (5) | INH_SST (9) | INH_SST (7) | INH_SST | 0.032058 | False | 6 |
2292 rows × 24 columns
We can also visualize metadata (categorical or continuous) on the UMAP/tSNE spaces.
plot_metadata(cistopic_obj,
reduction_name='UMAP',
variables=['cell_type', 'pycisTopic_leiden_10_0.6'], # Labels from RNA and new clusters
target='cell', num_columns=2,
text_size=24,
dot_size=15,
figsize=(20,10),
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/visualization/dimensionality_reduction_label.pdf')
/opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:472: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i)
We can also check other statistics. For example, here we see that many of the non-matching cells with the RNA profiles have a high number of fragments and high doublet scores.
plot_metadata(cistopic_obj,
reduction_name='UMAP',
variables=['Log_unique_nr_frag', 'TSS_enrichment', 'Doublet_scores_fragments', 'FRIP'],
target='cell', num_columns=2,
text_size=15,
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/visualization/dimensionality_reduction_number.pdf')
/opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:524: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i)
We can also plot the topic-contributions.
plot_topic(cistopic_obj,
reduction_name = 'UMAP',
target = 'cell',
num_columns=5,
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/visualization/dimensionality_reduction_topic_contr.pdf')
/opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:648: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i)
Or we can also draw a heatmap with the topic contributions (and annotations).
cell_topic_heatmap(cistopic_obj,
variables = ['cell_type'],
scale = False,
legend_loc_x = 1.05,
legend_loc_y = -1.2,
legend_dist_y = -1,
figsize=(10,10),
save = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/visualization/heatmap_topic_contr.pdf')
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'wb') as f:
pickle.dump(cistopic_obj, f)
When combining data originating from different statistical distributions, it is probably not a good idea to simply concatenate them together without taking into account their individual distributions as some of them might be of binary, categorical or continuous nature. This is the case when we compare scRNA-seq data (negative binomial) and scATAC-seq data (binary). We can convert these data set into a common non-parametric space where they loose the memory about their technology of origin using graph-based integration. When having been converted into graphs, the individual data sets represent pairwise connections between data points without any “memory” of what statistical process generated the individual data sets. In the graph space, it is straightforward to find an intersection between individual graphs from individual data sets by keeping edges consistently present between the data points across the graphs from individual data sets. The strength of this approach is that one can apply an appropriate distance metric when converting the raw data into graphs, i.e. working with binary data one might want to use the Hamming distance to compute pairwise connections between data points, while working with continuous data it might be sufficient to apply the Euclidean distance. In this section we use this approach, using UMAP to build fuzzy simplicial sets (similar to KNN graphs), that can be used for integrated clustering and dimensionality reduction.
In addition, we will integrate the cell-topic (pycisTopic) and the cell-PCs (Scanpy) matrices rather than the raw omics data.
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
# Load processed scRNA-seq data
import scanpy
import pandas as pd
infile = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/single_sample_scrublet/out/data/intermediate/1.0.0.SC__SCANPY__DIM_REDUCTION_PCA.h5ad'
vsn_obj = scanpy.read_h5ad(infile)
# Format names
rna_cell_names = vsn_obj.obs.index.tolist()
rna_cell_names = [x.replace('-1.0.0', '-1-10x_multiome_brain') for x in rna_cell_names]
rna_pca = pd.DataFrame(vsn_obj.obsm['X_pca'])
rna_pca.index = rna_cell_names
rna_pca.columns = ['PC' + str(x) for x in range(1,vsn_obj.obsm['X_pca'].shape[1]+1)]
With the parameter rna_weight we can also specify if any of the omics layers should have more weight than the other. Since the scATAC-seq layer in this data set is quite sparse, we will apply a 0.8 rna layer weight (0.2 atac layer weight).
from pycisTopic.clust_vis import *
find_clusters(cistopic_obj, k=100, rna_components=rna_pca, rna_weight = 0.8, prefix='RNA+ATAC_', res=[2])
2021-08-05 18:06:15,279 cisTopic INFO Finding neighbours 2021-08-05 18:06:15,870 cisTopic INFO Finding clusters Warning: Some cells in this CistopicObject are not present in this cell_data. Values will be filled with Nan
run_umap(cistopic_obj, rna_components=rna_pca, rna_weight = 0.8, reduction_name='RNA+ATAC_UMAP')
2021-08-05 18:06:27,387 cisTopic INFO Running UMAP
plot_metadata(cistopic_obj,
reduction_name='RNA+ATAC_UMAP',
variables=['cell_type', 'pycisTopic_leiden_10_0.6','RNA+ATAC_leiden_100_2'], # Labels from RNA and new clusters
target='cell', num_columns=3,
text_size=12,
dot_size=15,
figsize=(20,5))
/opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:472: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i)
plot_topic(cistopic_obj, reduction_name='RNA+ATAC_UMAP', num_columns=5)
/opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:648: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i)
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'wb') as f:
pickle.dump(cistopic_obj, f)
It is also possible to visualize scATAC-seq profiles from jupyter_lab using igv_jupyter lab. In this example, we will use the pseudobulk bigwigs generated in the begining using the cell_type variable; but you can generate any pseudobulk based on cell_data using the export_pseudobulk function.
First you will need to initialize the browser:
# Temporarilly you will need to use this version of the package until we update jupyterlab to the newest version
import os
os.chdir("/data/leuven/software/biomed/jupyterhub/miniconda/envs/jupyterhub/lib/python3.9/site-packages/")
from igv_jupyterlab.igv_widget import *
# When updated, use this instead
# from igv_jupyterlab import IGV
igv = IGV(genome="hg38")
display(igv)
# It will take some seconds to load, be patient ;)
Now you can load your profiles:
# Load bw paths
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/pseudobulk_bed_files/bw_paths.pkl', 'rb')
bw_paths = pickle.load(infile)
infile.close()
# You will see the files appearing in the browser
import random
prefix='https://lcbhub.aertslab.org/user/u0117456/23/files/' # Adapt to your session
for key in bw_paths.keys():
track = IGV.create_track(
name=key,
url=prefix+bw_paths[key],
color="#" + "%06x" % random.randint(0, 0xFFFFFF),
autoHeight = True,
min= 0,
max=10
)
igv.load_track(track)
# Add consensus regions
track = IGV.create_track(
name='Consensus regions',
url=prefix+'/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/consensus_peak_calling/consensus_regions.bed',
displayMode='Squished'
)
igv.load_track(track)
# To jump to a specific gene. You will see the browser changing. You can also do this directly in the browser.
igv.locus = 'SOX10'
You can save the view by clicking save SVG in the browser.
Next we can binarize topic-region and cell-topic distributions. The first is useful for exploring the topics with other tools that work with region sets (e.g. GREAT, cisTarget); while the latter is useful to automatically automate topics.
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
We will first binarize the topic-region distributions. There are several methods that can be used for this: 'otsu' (Otsu, 1979), 'yen' (Yen et al., 1995), 'li' (Li & Lee, 1993), 'aucell' (Van de Sande et al., 2020) or 'ntop' (Taking the top n regions per topic). Otsu and Yen's methods work well in topic-region distributions; however for some downstream analyses it may be convenient to use 'ntop' to have balanced region sets.
from pycisTopic.topic_binarization import *
region_bin_topics = binarize_topics(cistopic_obj, method='otsu', ntop=3000, plot=True, num_columns=5, save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/topic_binarization/otsu.pdf')
/opt/venv/lib/python3.8/site-packages/pycisTopic/topic_binarization.py:127: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, j)
Similarly, we can now binarize the cell-topic distribions.
binarized_cell_topic = binarize_topics(cistopic_obj, target='cell', method='li', plot=True, num_columns=5, nbins=100)
/opt/venv/lib/python3.8/site-packages/pycisTopic/topic_binarization.py:127: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, j)
We see that some thresholds are not very accurate. We can adjust the manually.
predifined_thr={'Topic12':0.7, 'Topic13':0.4, 'Topic27':0.5}
binarized_cell_topic = binarize_topics(cistopic_obj, target='cell', method='li', plot=True, num_columns=5, nbins=100, predifined_thr=predifined_thr)
Following, we can compute the topic quality control metrics. These include:
from pycisTopic.topic_qc import *
topic_qc_metrics = compute_topic_metrics(cistopic_obj)
We will create a figure dictionary to put all plots together. In this case, we will not put any threshold to filter topics.
fig_dict={}
fig_dict['CoherenceVSAssignments']=plot_topic_qc(topic_qc_metrics, var_x='Coherence', var_y='Log10_Assignments', var_color='Gini_index', plot=False, return_fig=True)
fig_dict['AssignmentsVSCells_in_bin']=plot_topic_qc(topic_qc_metrics, var_x='Log10_Assignments', var_y='Cells_in_binarized_topic', var_color='Gini_index', plot=False, return_fig=True)
fig_dict['CoherenceVSCells_in_bin']=plot_topic_qc(topic_qc_metrics, var_x='Coherence', var_y='Cells_in_binarized_topic', var_color='Gini_index', plot=False, return_fig=True)
fig_dict['CoherenceVSRegions_in_bin']=plot_topic_qc(topic_qc_metrics, var_x='Coherence', var_y='Regions_in_binarized_topic', var_color='Gini_index', plot=False, return_fig=True)
fig_dict['CoherenceVSMarginal_dist']=plot_topic_qc(topic_qc_metrics, var_x='Coherence', var_y='Marginal_topic_dist', var_color='Gini_index', plot=False, return_fig=True)
fig_dict['CoherenceVSGini_index']=plot_topic_qc(topic_qc_metrics, var_x='Coherence', var_y='Gini_index', var_color='Gini_index', plot=False, return_fig=True)
# Plot topic stats in one figure
fig=plt.figure(figsize=(40, 43))
i = 1
for fig_ in fig_dict.keys():
plt.subplot(2, 3, i)
img = fig2img(fig_dict[fig_]) #To convert figures to png to plot together, see .utils.py. This converts the figure to png.
plt.imshow(img)
plt.axis('off')
i += 1
plt.subplots_adjust(wspace=0, hspace=-0.70)
fig.savefig('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/topic_binarization/Topic_qc.pdf', bbox_inches='tight')
plt.show()
Topic 28 (Astrocytes) is an example of a good topic: it has a high coherence and a high number of assignmnents/marginal topic score (these measurements are correlated), while being cell type specific (with a high gini index). Topic 5 is an example of a bad topic: it has low coherence and a low number of assignments while being general (found in many cells and with and low gini index). In some ocassions, for example topic 29 we observe a low number of assignments and a low coherence, but this can also occur if topics are very specific. Generally, we will consider a topic as bad if it has a low number of assignments while being assigned to many cells and a low coherence.
Next, we can automatically annotate topics, in this case by cell type. Here we calculate the proportion of cells in each group that are assigned to the binarized topic in comparison to the ratio in the whole data set. We will consider a topic as general if the difference between the ration of cells in the whole data set in the binarized topic and the ratio of total cells in the assigned groups is above 0.2. This indicates that the topic is general, and the propotion test may fail if the topic is enriched in both foreground (the group) and background (the whole data set); resulting in a big difference between the ratios.
topic_annot = topic_annotation(cistopic_obj, annot_var='cell_type', binarized_cell_topic=binarized_cell_topic, general_topic_thr = 0.2)
/opt/venv/lib/python3.8/site-packages/statsmodels/stats/weightstats.py:790: RuntimeWarning: divide by zero encountered in double_scalars zstat = value / std
topic_annot
| cell_type | Ratio_cells_in_topic | Ratio_group_in_population | is_general | |
|---|---|---|---|---|
| Topic1 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.171902 | 0.135253 | False |
| Topic2 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.169721 | 0.135253 | False |
| Topic3 | MOL_A, MOL_B | 0.532723 | 0.424084 | False |
| Topic4 | GP, MOL_A, MOL_B, INH_PVALB, GC, PURK, INH_SST... | 0.626527 | 0.559337 | False |
| Topic5 | GP, MOL_A, MOL_B, INH_PVALB, PURK, INH_SST, IN... | 0.588133 | 0.536649 | False |
| Topic6 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.17103 | 0.135253 | False |
| Topic7 | OPC, ASTP, AST_CER, ENDO, AST, MGL | 0.327225 | 0.263089 | False |
| Topic8 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.443717 | 0.36911 | False |
| Topic9 | MOL_A, MOL_B | 0.531414 | 0.424084 | False |
| Topic10 | MOL_A, MOL_B | 0.533159 | 0.424084 | False |
| Topic11 | OPC, MOL_A, MOL_B, ENDO, MGL | 0.730803 | 0.527923 | True |
| Topic12 | OPC, ENDO, MGL | 0.0637 | 0.103839 | False |
| Topic13 | MOL_A, MOL_B | 0.533159 | 0.424084 | False |
| Topic14 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.471204 | 0.398342 | False |
| Topic15 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.166667 | 0.135253 | False |
| Topic16 | GP, INH_PVALB, GC, ENDO, PURK, MGL, INH_SST, I... | 0.222949 | 0.170593 | False |
| Topic17 | OPC, ASTP, AST_CER, ENDO, AST, MGL | 0.312391 | 0.263089 | False |
| Topic18 | MGL | 0.039267 | 0.029232 | False |
| Topic19 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.431501 | 0.36911 | False |
| Topic20 | GP, INH_PVALB, GC, ENDO, PURK, INH_SST, INH_VIP | 0.172775 | 0.141361 | False |
| Topic21 | GP, INH_PVALB, GC, ENDO, PURK, INH_SST, INH_VIP | 0.17452 | 0.141361 | False |
| Topic22 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.431065 | 0.36911 | False |
| Topic23 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.165794 | 0.135253 | False |
| Topic24 | MOL_A, MOL_B | 0.532723 | 0.424084 | False |
| Topic25 | MOL_A, MOL_B | 0.53534 | 0.424084 | False |
| Topic26 | MOL_A, MOL_B | 0.535777 | 0.424084 | False |
| Topic27 | MOL_A, MOL_B, MGL | 0.529232 | 0.453316 | False |
| Topic28 | OPC, ASTP, AST_CER, AST | 0.263525 | 0.227749 | False |
| Topic29 | MGL | 0.041012 | 0.029232 | False |
| Topic30 | OPC, ASTP, AST_CER, ENDO, AST | 0.280977 | 0.233857 | False |
| Topic31 | OPC, ASTP, MOL_A, MOL_B, AST_CER, AST, MGL | 0.830716 | 0.681065 | False |
| Topic32 | OPC, GP, INH_PVALB, GC, ENDO, AST, PURK, MGL, ... | 0.292321 | 0.251745 | False |
| Topic33 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.433246 | 0.36911 | False |
| Topic34 | OPC, GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.236475 | 0.203752 | False |
| Topic35 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.469895 | 0.398342 | False |
| Topic36 | OPC, ASTP, AST_CER, ENDO, AST | 0.280105 | 0.233857 | False |
| Topic37 | MOL_A, MOL_B | 0.532723 | 0.424084 | False |
| Topic38 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.164485 | 0.135253 | False |
| Topic39 | OPC, GP, INH_PVALB, GC, ENDO, PURK, MGL, INH_S... | 0.364311 | 0.239092 | False |
| Topic40 | OPC, ASTP, AST_CER, ENDO, AST | 0.269634 | 0.233857 | False |
We can merge the topic metrics and their annotation in a data frame.
topic_qc_metrics = pd.concat([topic_annot[['cell_type', 'Ratio_cells_in_topic', 'Ratio_group_in_population']], topic_qc_metrics], axis=1)
topic_qc_metrics
| cell_type | Ratio_cells_in_topic | Ratio_group_in_population | Log10_Assignments | Assignments | Regions_in_binarized_topic | Cells_in_binarized_topic | Coherence | Marginal_topic_dist | Gini_index | |
|---|---|---|---|---|---|---|---|---|---|---|
| Topic1 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.171902 | 0.135253 | 5.704676 | 506613 | 12394 | 394 | -1.359823 | 0.015349 | 0.565560 |
| Topic2 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.169721 | 0.135253 | 5.636994 | 433505 | 8915 | 389 | -1.282066 | 0.013145 | 0.558768 |
| Topic3 | MOL_A, MOL_B | 0.532723 | 0.424084 | 5.935460 | 861907 | 3593 | 1221 | -1.086476 | 0.025983 | 0.429690 |
| Topic4 | GP, MOL_A, MOL_B, INH_PVALB, GC, PURK, INH_SST... | 0.626527 | 0.559337 | 5.712850 | 516238 | 6046 | 1436 | -1.403527 | 0.015620 | 0.223495 |
| Topic5 | GP, MOL_A, MOL_B, INH_PVALB, PURK, INH_SST, IN... | 0.588133 | 0.536649 | 5.781202 | 604230 | 3182 | 1348 | -1.293442 | 0.018267 | 0.140524 |
| Topic6 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.17103 | 0.135253 | 5.673577 | 471604 | 11592 | 392 | -1.254359 | 0.014293 | 0.572735 |
| Topic7 | OPC, ASTP, AST_CER, ENDO, AST, MGL | 0.327225 | 0.263089 | 5.875961 | 751556 | 1457 | 750 | -0.792607 | 0.022706 | 0.221335 |
| Topic8 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.443717 | 0.36911 | 5.786005 | 610949 | 4527 | 1017 | -1.077836 | 0.018483 | 0.372627 |
| Topic9 | MOL_A, MOL_B | 0.531414 | 0.424084 | 5.805062 | 638354 | 4963 | 1218 | -1.295017 | 0.019281 | 0.318249 |
| Topic10 | MOL_A, MOL_B | 0.533159 | 0.424084 | 5.942749 | 876494 | 1357 | 1222 | -0.756734 | 0.026440 | 0.249621 |
| Topic11 | OPC, MOL_A, MOL_B, ENDO, MGL | 0.730803 | 0.527923 | 6.087209 | 1222388 | 1462 | 1675 | -0.652962 | 0.036856 | 0.106268 |
| Topic12 | OPC, ENDO, MGL | 0.0637 | 0.103839 | 6.033457 | 1080083 | 1612 | 146 | -0.667082 | 0.032584 | 0.078366 |
| Topic13 | MOL_A, MOL_B | 0.533159 | 0.424084 | 5.987684 | 972040 | 3148 | 1222 | -0.914179 | 0.029295 | 0.384448 |
| Topic14 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.471204 | 0.398342 | 5.815732 | 654232 | 3043 | 1080 | -1.033843 | 0.019788 | 0.386267 |
| Topic15 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.166667 | 0.135253 | 5.716365 | 520433 | 11747 | 382 | -1.268933 | 0.015763 | 0.550851 |
| Topic16 | GP, INH_PVALB, GC, ENDO, PURK, MGL, INH_SST, I... | 0.222949 | 0.170593 | 5.629432 | 426022 | 8674 | 511 | -1.341559 | 0.012914 | 0.523063 |
| Topic17 | OPC, ASTP, AST_CER, ENDO, AST, MGL | 0.312391 | 0.263089 | 5.877697 | 754566 | 4047 | 716 | -0.944157 | 0.022807 | 0.432536 |
| Topic18 | MGL | 0.039267 | 0.029232 | 5.649517 | 446187 | 6652 | 90 | -1.184038 | 0.013515 | 0.404096 |
| Topic19 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.431501 | 0.36911 | 5.773615 | 593765 | 7816 | 989 | -1.183107 | 0.017968 | 0.394716 |
| Topic20 | GP, INH_PVALB, GC, ENDO, PURK, INH_SST, INH_VIP | 0.172775 | 0.141361 | 5.706345 | 508563 | 12459 | 396 | -1.477039 | 0.015408 | 0.563245 |
| Topic21 | GP, INH_PVALB, GC, ENDO, PURK, INH_SST, INH_VIP | 0.17452 | 0.141361 | 5.697476 | 498283 | 13030 | 400 | -1.430159 | 0.015099 | 0.553346 |
| Topic22 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.431065 | 0.36911 | 5.823233 | 665630 | 7150 | 988 | -1.212937 | 0.020134 | 0.418485 |
| Topic23 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.165794 | 0.135253 | 5.751432 | 564198 | 14698 | 380 | -1.241550 | 0.017086 | 0.606629 |
| Topic24 | MOL_A, MOL_B | 0.532723 | 0.424084 | 5.950121 | 891499 | 3912 | 1221 | -1.137307 | 0.026881 | 0.371884 |
| Topic25 | MOL_A, MOL_B | 0.53534 | 0.424084 | 6.080228 | 1202896 | 3792 | 1227 | -0.824187 | 0.036226 | 0.416126 |
| Topic26 | MOL_A, MOL_B | 0.535777 | 0.424084 | 6.169499 | 1477404 | 4118 | 1228 | -0.739712 | 0.044466 | 0.437238 |
| Topic27 | MOL_A, MOL_B, MGL | 0.529232 | 0.453316 | 6.437009 | 2735328 | 3232 | 1213 | -0.548188 | 0.082341 | 0.164777 |
| Topic28 | OPC, ASTP, AST_CER, AST | 0.263525 | 0.227749 | 6.035530 | 1085250 | 7936 | 604 | -0.819805 | 0.032776 | 0.652999 |
| Topic29 | MGL | 0.041012 | 0.029232 | 5.609652 | 407054 | 11535 | 94 | -1.210414 | 0.012333 | 0.630701 |
| Topic30 | OPC, ASTP, AST_CER, ENDO, AST | 0.280977 | 0.233857 | 5.943759 | 878534 | 7097 | 644 | -0.803423 | 0.026548 | 0.620820 |
| Topic31 | OPC, ASTP, MOL_A, MOL_B, AST_CER, AST, MGL | 0.830716 | 0.681065 | 6.275622 | 1886349 | 2627 | 1904 | -0.652039 | 0.056827 | 0.117598 |
| Topic32 | OPC, GP, INH_PVALB, GC, ENDO, AST, PURK, MGL, ... | 0.292321 | 0.251745 | 5.801159 | 632643 | 6702 | 670 | -1.195244 | 0.019134 | 0.300821 |
| Topic33 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.433246 | 0.36911 | 5.879434 | 757590 | 5992 | 993 | -1.057348 | 0.022905 | 0.457853 |
| Topic34 | OPC, GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.236475 | 0.203752 | 5.804833 | 638018 | 10585 | 542 | -1.136387 | 0.019300 | 0.543368 |
| Topic35 | OPC, ASTP, GP, INH_PVALB, GC, AST_CER, ENDO, A... | 0.469895 | 0.398342 | 5.940404 | 871775 | 1856 | 1077 | -0.670262 | 0.026332 | 0.374168 |
| Topic36 | OPC, ASTP, AST_CER, ENDO, AST | 0.280105 | 0.233857 | 5.984187 | 964245 | 7101 | 642 | -0.912877 | 0.029130 | 0.639285 |
| Topic37 | MOL_A, MOL_B | 0.532723 | 0.424084 | 5.934439 | 859882 | 4040 | 1221 | -1.128970 | 0.025936 | 0.349085 |
| Topic38 | GP, INH_PVALB, GC, PURK, INH_SST, INH_VIP | 0.164485 | 0.135253 | 5.764113 | 580916 | 13567 | 377 | -1.184791 | 0.017589 | 0.609332 |
| Topic39 | OPC, GP, INH_PVALB, GC, ENDO, PURK, MGL, INH_S... | 0.364311 | 0.239092 | 5.972132 | 937846 | 3935 | 835 | -1.078419 | 0.028311 | 0.165771 |
| Topic40 | OPC, ASTP, AST_CER, ENDO, AST | 0.269634 | 0.233857 | 6.053848 | 1132004 | 10102 | 618 | -1.003872 | 0.034184 | 0.625704 |
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/topic_binarization/Topic_qc_metrics_annot.pkl', 'wb') as f:
pickle.dump(topic_qc_metrics, f)
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/topic_binarization/binarized_cell_topic.pkl', 'wb') as f:
pickle.dump(binarized_cell_topic, f)
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/topic_binarization/binarized_topic_region.pkl', 'wb') as f:
pickle.dump(region_bin_topics, f)
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
Together with working with regulatory topics, we can also identify differentially accessible regions (DARs) between cell types. First, we will impute the region accessibility exploting the cell-topic and topic-region probabilities. To shrink very low probability values to 0, we use a scale factor (by default: 10^6).
from pycisTopic.diff_features import *
imputed_acc_obj = impute_accessibility(cistopic_obj, selected_cells=None, selected_regions=None, scale_factor=10**6)
2021-08-06 14:30:03,695 cisTopic INFO Imputing drop-outs 2021-08-06 14:30:16,393 cisTopic INFO Removing small values 2021-08-06 14:30:22,665 cisTopic INFO Converting to sparse matrix 2021-08-06 14:31:03,469 cisTopic INFO Scaling 2021-08-06 14:31:05,074 cisTopic INFO Keep non zero rows 2021-08-06 14:31:10,851 cisTopic INFO Create CistopicImputedFeatures object 2021-08-06 14:31:10,856 cisTopic INFO Done!
Next we will log-normalize the imputed data.
normalized_imputed_acc_obj = normalize_scores(imputed_acc_obj, scale_factor=10**4)
2021-08-06 14:34:30,488 cisTopic INFO Normalizing imputed data 2021-08-06 14:35:23,780 cisTopic INFO Done!
Optionally, we can identify highly variable regions. This is not mandatory, but will speed up the hypothesis testing step for identifying DARs.
variable_regions = find_highly_variable_features(normalized_imputed_acc_obj,
min_disp = 0.05,
min_mean = 0.0125,
max_mean = 3,
max_disp = np.inf,
n_bins=20,
n_top_features=None,
plot=True,
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/HVR_plot.pdf')
2021-08-06 14:37:21,892 cisTopic INFO Calculating mean and variance
2021-08-06 14:38:58,851 cisTopic INFO Done!
There is a total of 75,341 variable features.
len(variable_regions)
75341
We can now identify differentially accessible regions between groups. By default, this function will perform a Wilcoxon rank-sum test between each group in the specified variable and the rest. Alternatively, specified contrast can be provided as a list with foreground and background groups (e.g. for group 1 versus group 2 and 3, and group 2 versus group 1 and 3: [[['Group_1'], ['Group_2, 'Group_3']], [['Group_2'], ['Group_1, 'Group_3']]]).
markers_dict= find_diff_features(cistopic_obj,
imputed_acc_obj,
variable='cell_type',
var_features=variable_regions,
contrasts=None,
adjpval_thr=0.05,
log2fc_thr=np.log2(1.5),
n_cpu=5,
_temp_dir='/scratch/leuven/313/vsc31305/ray_spill')
2021-08-06 14:42:05,016 INFO services.py:1245 -- View the Ray dashboard at http://127.0.0.1:8265
(pid=24880) 2021-08-06 14:42:09,268 cisTopic INFO Computing p-value for GC (pid=24883) 2021-08-06 14:42:09,303 cisTopic INFO Computing p-value for AST_CER (pid=24882) 2021-08-06 14:42:09,281 cisTopic INFO Computing p-value for ENDO (pid=24881) 2021-08-06 14:42:09,358 cisTopic INFO Computing p-value for ASTP (pid=24879) 2021-08-06 14:42:09,308 cisTopic INFO Computing p-value for AST (pid=24879) 2021-08-06 14:43:31,167 cisTopic INFO Computing log2FC for AST (pid=24883) 2021-08-06 14:43:34,287 cisTopic INFO Computing log2FC for AST_CER (pid=24880) 2021-08-06 14:43:34,545 cisTopic INFO Computing log2FC for GC (pid=24881) 2021-08-06 14:43:38,087 cisTopic INFO Computing log2FC for ASTP (pid=24882) 2021-08-06 14:43:40,628 cisTopic INFO Computing log2FC for ENDO (pid=24879) 2021-08-06 14:44:21,633 cisTopic INFO AST done! (pid=24879) 2021-08-06 14:44:21,927 cisTopic INFO Computing p-value for GP (pid=24883) 2021-08-06 14:44:26,051 cisTopic INFO AST_CER done! (pid=24883) 2021-08-06 14:44:26,330 cisTopic INFO Computing p-value for INH_PVALB (pid=24880) 2021-08-06 14:44:26,640 cisTopic INFO GC done! (pid=24880) 2021-08-06 14:44:26,787 cisTopic INFO Computing p-value for INH_SST (pid=24881) 2021-08-06 14:44:32,548 cisTopic INFO ASTP done! (pid=24881) 2021-08-06 14:44:32,844 cisTopic INFO Computing p-value for INH_VIP (pid=24882) 2021-08-06 14:44:35,954 cisTopic INFO ENDO done! (pid=24882) 2021-08-06 14:44:36,129 cisTopic INFO Computing p-value for MGL (pid=24879) 2021-08-06 14:45:42,672 cisTopic INFO Computing log2FC for GP (pid=24883) 2021-08-06 14:45:48,201 cisTopic INFO Computing log2FC for INH_PVALB (pid=24880) 2021-08-06 14:45:50,076 cisTopic INFO Computing log2FC for INH_SST (pid=24881) 2021-08-06 14:45:59,945 cisTopic INFO Computing log2FC for INH_VIP (pid=24882) 2021-08-06 14:46:05,806 cisTopic INFO Computing log2FC for MGL (pid=24879) 2021-08-06 14:46:33,281 cisTopic INFO GP done! (pid=24879) 2021-08-06 14:46:33,441 cisTopic INFO Computing p-value for MOL_A (pid=24883) 2021-08-06 14:46:38,408 cisTopic INFO INH_PVALB done! (pid=24883) 2021-08-06 14:46:38,575 cisTopic INFO Computing p-value for MOL_B (pid=24880) 2021-08-06 14:46:41,454 cisTopic INFO INH_SST done! (pid=24880) 2021-08-06 14:46:41,620 cisTopic INFO Computing p-value for OPC (pid=24881) 2021-08-06 14:46:52,916 cisTopic INFO INH_VIP done! (pid=24881) 2021-08-06 14:46:53,107 cisTopic INFO Computing p-value for PURK (pid=24882) 2021-08-06 14:47:00,547 cisTopic INFO MGL done! (pid=24879) 2021-08-06 14:47:55,329 cisTopic INFO Computing log2FC for MOL_A (pid=24883) 2021-08-06 14:48:02,033 cisTopic INFO Computing log2FC for MOL_B (pid=24880) 2021-08-06 14:48:05,721 cisTopic INFO Computing log2FC for OPC (pid=24881) 2021-08-06 14:48:19,365 cisTopic INFO Computing log2FC for PURK (pid=24879) 2021-08-06 14:48:47,644 cisTopic INFO MOL_A done! (pid=24883) 2021-08-06 14:48:53,878 cisTopic INFO MOL_B done! (pid=24880) 2021-08-06 14:48:58,363 cisTopic INFO OPC done! (pid=24881) 2021-08-06 14:49:12,058 cisTopic INFO PURK done!
We can also plot region accessibility into the cell-topic UMAP. For example, let's check how the best DARs for some cell types look like.
from pycisTopic.clust_vis import *
plot_imputed_features(cistopic_obj,
reduction_name='RNA+ATAC_UMAP',
imputed_data=imputed_acc_obj,
features=[markers_dict[x].index.tolist()[0] for x in ['AST_CER', 'GC', 'INH_SST', 'MGL']],
scale=False,
num_columns=4,
selected_cells = cistopic_obj.projections['cell']['RNA+ATAC_UMAP'].index.tolist(),
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/example_best_DARs.pdf')
/opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i)
How many DARs do we find per cell type?
x = [print(x + ': '+ str(len(markers_dict[x]))) for x in markers_dict.keys()]
AST: 33218 ASTP: 32291 AST_CER: 33092 ENDO: 18622 GC: 16680 GP: 18355 INH_PVALB: 19574 INH_SST: 19959 INH_VIP: 20392 MGL: 10839 MOL_A: 26774 MOL_B: 27554 OPC: 19300 PURK: 20886
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/Imputed_accessibility.pkl', 'wb') as f:
pickle.dump(imputed_acc_obj, f)
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/DARs.pkl', 'wb') as f:
pickle.dump(markers_dict, f)
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
# Load imputed accessibility
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/Imputed_accessibility.pkl', 'rb')
imputed_acc_obj = pickle.load(infile)
infile.close()
# Load DARs
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/DARs.pkl', 'rb')
DARs_dict = pickle.load(infile)
infile.close()
After imputing region accessibility, we can infer gene accessibility. Next we need to retrieve gene annotation and chromosome sizes for our genome.
# Get TSS annotations
import pybiomart as pbm
import pyranges as pr
# For mouse
#dataset = pbm.Dataset(name='mmusculus_gene_ensembl', host='http://www.ensembl.org')
# For human
dataset = pbm.Dataset(name='hsapiens_gene_ensembl', host='http://www.ensembl.org')
# For fly
#dataset = pbm.Dataset(name='dmelanogaster_gene_ensembl', host='http://www.ensembl.org')
annot = dataset.query(attributes=['chromosome_name', 'start_position', 'end_position', 'strand', 'external_gene_name', 'transcription_start_site', 'transcript_biotype'])
annot['Chromosome/scaffold name'] = 'chr' + annot['Chromosome/scaffold name'].astype(str)
annot.columns=['Chromosome', 'Start', 'End', 'Strand', 'Gene','Transcription_Start_Site', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']
annot.Strand[annot.Strand == 1] = '+'
annot.Strand[annot.Strand == -1] = '-'
pr_annotation = pr.PyRanges(annot.dropna(axis = 0))
# Get chromosome sizes
import pandas as pd
import requests
target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
chromsizes=pd.read_csv(target_url, sep='\t', header=None)
chromsizes.columns=['Chromosome', 'End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
chromsizes=pr.PyRanges(chromsizes)
Now we can infer gene activity. In this function there are several options to evaluate:
use_gene_boundaries), and the minimum and maximum distance it should have (upstream and downstream)distance_weight) and the effect of the distance (decay_rate). gene_size_weight), which by default is dividing the size of each gene by the median gene size in the genome. Alternatively, the user can also use average_scores which will calculate the gene activity as the mean weighted region accessibility of all regions linked to the gene.gini_weight). In this notebook we select the parameters as indicated below. To speed up calculations, it is also possible to only work with regions in topics or DARs.
from pycisTopic.gene_activity import *
gene_act, weigths = get_gene_activity(imputed_acc_obj, # Region-cell probabilities
pr_annotation, # Gene annotation
chromsizes, # Chromosome size
use_gene_boundaries=True, # Whether to use the whole search space or stop when encountering another gene
upstream=[1000, 100000], # Search space upstream. The minimum means that even if there is a gene right next to it
#these bp will be taken (1kbp here)
downstream=[1000,100000], # Search space downstream
distance_weight=True, # Whether to add a distance weight (an exponential function, the weight will decrease with distance)
decay_rate=1, # Exponent for the distance exponential funciton (the higher the faster will be the decrease)
extend_gene_body_upstream=10000, # Number of bp upstream immune to the distance weight (their value will be maximum for
#this weight)
extend_gene_body_downstream=500, # Number of bp downstream immune to the distance weight
gene_size_weight=False, # Whether to add a weights based on the length of the gene
gene_size_scale_factor='median', # Dividend to calculate the gene size weigth. Default is the median value of all genes
#in the genome
remove_promoters=False, # Whether to remove promoters when computing gene activity scores
average_scores=True, # Whether to divide by the total number of region assigned to a gene when calculating the gene
#activity score
scale_factor=1, # Value to multiply for the final gene activity matrix
extend_tss=[10,10], # Space to consider a promoter
gini_weight = True, # Whether to add a gini index weigth. The more unique the region is, the higher this weight will be
return_weights= True, # Whether to return the final weights
project='Gene_activity') # Project name for the gene activity object
2021-08-06 14:52:05,578 cisTopic INFO Calculating gene boundaries 2021-08-06 14:52:17,280 cisTopic INFO Calculating distances 2021-08-06 14:53:29,473 cisTopic INFO Calculating distance weigths 2021-08-06 14:53:30,972 cisTopic INFO Distance weights done 2021-08-06 14:53:30,973 cisTopic INFO Calculating gini weights 2021-08-06 14:55:43,747 cisTopic INFO Getting gene activity scores 2021-08-06 15:07:03,686 cisTopic INFO Creating imputed features object
As we did before for the imputed region accessibility, we can also infer the Differentially Accessible Genes (DAGs).
markers_dict= find_diff_features(cistopic_obj,
gene_act,
variable='cell_type',
var_features=None,
contrasts=None,
adjpval_thr=0.05,
log2fc_thr=np.log2(1.5),
n_cpu=5,
_temp_dir='/scratch/leuven/313/vsc31305/ray_spill')
2021-08-06 15:15:22,980 INFO services.py:1245 -- View the Ray dashboard at http://127.0.0.1:8265
(pid=3957) 2021-08-06 15:15:27,019 cisTopic INFO Computing p-value for AST (pid=3956) 2021-08-06 15:15:27,073 cisTopic INFO Computing p-value for ASTP (pid=3958) 2021-08-06 15:15:27,109 cisTopic INFO Computing p-value for ENDO (pid=3955) 2021-08-06 15:15:27,088 cisTopic INFO Computing p-value for AST_CER (pid=3959) 2021-08-06 15:15:27,082 cisTopic INFO Computing p-value for GC (pid=3957) 2021-08-06 15:15:38,546 cisTopic INFO Computing log2FC for AST (pid=3956) 2021-08-06 15:15:38,628 cisTopic INFO Computing log2FC for ASTP (pid=3958) 2021-08-06 15:15:38,600 cisTopic INFO Computing log2FC for ENDO (pid=3955) 2021-08-06 15:15:38,634 cisTopic INFO Computing log2FC for AST_CER (pid=3959) 2021-08-06 15:15:38,779 cisTopic INFO Computing log2FC for GC (pid=3957) 2021-08-06 15:15:41,687 cisTopic INFO AST done! (pid=3957) 2021-08-06 15:15:41,712 cisTopic INFO Computing p-value for INH_PVALB (pid=3958) 2021-08-06 15:15:41,666 cisTopic INFO ENDO done! (pid=3958) 2021-08-06 15:15:41,688 cisTopic INFO Computing p-value for GP (pid=3956) 2021-08-06 15:15:41,739 cisTopic INFO ASTP done! (pid=3956) 2021-08-06 15:15:41,765 cisTopic INFO Computing p-value for INH_SST (pid=3955) 2021-08-06 15:15:41,830 cisTopic INFO AST_CER done! (pid=3955) 2021-08-06 15:15:41,864 cisTopic INFO Computing p-value for INH_VIP (pid=3959) 2021-08-06 15:15:41,997 cisTopic INFO GC done! (pid=3959) 2021-08-06 15:15:42,033 cisTopic INFO Computing p-value for MGL (pid=3957) 2021-08-06 15:15:53,103 cisTopic INFO Computing log2FC for INH_PVALB (pid=3958) 2021-08-06 15:15:53,146 cisTopic INFO Computing log2FC for GP (pid=3956) 2021-08-06 15:15:53,213 cisTopic INFO Computing log2FC for INH_SST (pid=3955) 2021-08-06 15:15:53,362 cisTopic INFO Computing log2FC for INH_VIP (pid=3959) 2021-08-06 15:15:53,686 cisTopic INFO Computing log2FC for MGL (pid=3957) 2021-08-06 15:15:56,074 cisTopic INFO INH_PVALB done! (pid=3957) 2021-08-06 15:15:56,105 cisTopic INFO Computing p-value for MOL_A (pid=3958) 2021-08-06 15:15:56,118 cisTopic INFO GP done! (pid=3958) 2021-08-06 15:15:56,156 cisTopic INFO Computing p-value for MOL_B (pid=3956) 2021-08-06 15:15:56,176 cisTopic INFO INH_SST done! (pid=3956) 2021-08-06 15:15:56,215 cisTopic INFO Computing p-value for OPC (pid=3955) 2021-08-06 15:15:56,413 cisTopic INFO INH_VIP done! (pid=3955) 2021-08-06 15:15:56,448 cisTopic INFO Computing p-value for PURK (pid=3959) 2021-08-06 15:15:56,822 cisTopic INFO MGL done! (pid=3957) 2021-08-06 15:16:07,411 cisTopic INFO Computing log2FC for MOL_A (pid=3956) 2021-08-06 15:16:07,531 cisTopic INFO Computing log2FC for OPC (pid=3958) 2021-08-06 15:16:07,497 cisTopic INFO Computing log2FC for MOL_B (pid=3955) 2021-08-06 15:16:07,815 cisTopic INFO Computing log2FC for PURK (pid=3956) 2021-08-06 15:16:10,649 cisTopic INFO OPC done! (pid=3958) 2021-08-06 15:16:10,627 cisTopic INFO MOL_B done! (pid=3957) 2021-08-06 15:16:10,700 cisTopic INFO MOL_A done!
from pycisTopic.clust_vis import *
plot_imputed_features(cistopic_obj,
reduction_name='RNA+ATAC_UMAP',
imputed_data=gene_act,
features=['PDGFRA', 'OLIG2', 'MBP', 'SOX10', # Olig differentiation
'CTNNA3', 'SLC5A11', 'ENPP6', 'OLIG1', # Olig differentiation
'GAD2', 'VIP', 'SST', 'CTXN3', # Int
'NFIB', 'SOX9', #Ast
'LEF1', #Endo
'SPI1'], #Glia
scale=True,
num_columns=4,
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DAGs/example_best_DAGs.pdf')
/opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i) /opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:763: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i)
How many DAGs do we have per cell type?
x = [print(x + ': '+ str(len(markers_dict[x]))) for x in markers_dict.keys()]
AST: 2050 ASTP: 2141 AST_CER: 2590 ENDO: 1425 GC: 3144 GP: 3158 INH_PVALB: 2887 INH_SST: 3244 INH_VIP: 3109 MGL: 3393 MOL_A: 1109 MOL_B: 1331 OPC: 1155 PURK: 3045
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DAGs/Gene_activity.pkl', 'wb') as f:
pickle.dump(gene_act, f)
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DAGs/DAGs.pkl', 'wb') as f:
pickle.dump(markers_dict, f)
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
Exploting the gene activity scores, we can transfer labels from a reference data set (e.g. scRNA-seq). As an example, we will transfer labels from the scRNA-seq layer of this data set.
# Prepare RNA
from loomxpy.loomxpy import SCopeLoom
import itertools
import anndata
path_to_loom = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/add_clusters_scrublet_as_annot/10x_multiome_brain_SCENIC_SCope_output_wAnnot_noDBL-2.loom'
loom = SCopeLoom.read_loom(path_to_loom)
metadata = get_metadata(loom)
expr_mat = loom.ex_mtx
rna_anndata = anndata.AnnData(X=expr_mat)
rna_anndata.obs = metadata
# Prepare ATAC
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DAGs/Gene_activity.pkl', 'rb') #Here I am using pycisTopic gene activity matrix, but could be any :)
gene_act = pickle.load(infile)
infile.close()
atac_anndata = anndata.AnnData(X=gene_act.mtx.T, obs=pd.DataFrame(index=gene_act.cell_names), var=pd.DataFrame(index=gene_act.feature_names))
atac_anndata.obs = cistopic_obj.cell_data
The methods available for label transferring are: ingest' [from scanpy], 'harmony' [Korsunsky et al, 2019], 'bbknn' [Polański et al, 2020], 'scanorama' [Hie et al, 2019] and 'cca'. Except for ingest, these methods return a common coembedding and labels are inferred using the distances between query and refenrence cells as weights.
from pycisTopic.label_transfer import *
label_dict = label_transfer(rna_anndata,
atac_anndata,
labels_to_transfer = ['cell_type'],
variable_genes = True,
methods = ['ingest', 'harmony', 'bbknn', 'scanorama', 'cca'],
return_label_weights = False,
_temp_dir='/scratch/leuven/313/vsc31305/ray_spill')
2021-08-06 15:27:47,646 cisTopic INFO Normalizing RNA data
/opt/venv/lib/python3.8/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Revieved a view of an AnnData. Making a copy. view_to_actual(adata)
2021-08-06 15:28:02,674 cisTopic INFO Processing 1 query sample(s) using 1 cpu(s)
2021-08-06 15:28:07,882 INFO services.py:1245 -- View the Ray dashboard at http://127.0.0.1:8265
(pid=10132) 2021-08-06 15:28:12,466 cisTopic INFO Normalizing ATAC data from sample 10x_multiome_brain
(pid=10132) /opt/venv/lib/python3.8/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Revieved a view of an AnnData. Making a copy. (pid=10132) view_to_actual(adata)
(pid=10132) 2021-08-06 15:28:21,129 cisTopic INFO Running integration with ingest (pid=10132) 2021-08-06 15:28:51,791 cisTopic INFO Running integration with harmony (pid=10132) 2021-08-06 15:28:53,264 harmonypy INFO Iteration 1 of 10
(pid=10132) 2021-08-06 15:28:53,264 - harmonypy - INFO - Iteration 1 of 10
(pid=10132) 2021-08-06 15:28:54,191 harmonypy INFO Iteration 2 of 10
(pid=10132) 2021-08-06 15:28:54,191 - harmonypy - INFO - Iteration 2 of 10
(pid=10132) 2021-08-06 15:28:55,171 harmonypy INFO Iteration 3 of 10
(pid=10132) 2021-08-06 15:28:55,171 - harmonypy - INFO - Iteration 3 of 10
(pid=10132) 2021-08-06 15:28:55,793 harmonypy INFO Iteration 4 of 10
(pid=10132) 2021-08-06 15:28:55,793 - harmonypy - INFO - Iteration 4 of 10
(pid=10132) 2021-08-06 15:28:56,247 harmonypy INFO Converged after 4 iterations (pid=10132) 2021-08-06 15:28:56,316 cisTopic INFO Running integration with bbknn
(pid=10132) 2021-08-06 15:28:56,247 - harmonypy - INFO - Converged after 4 iterations
(pid=10132) 2021-08-06 15:29:09,116 cisTopic INFO Running integration with scanorama (pid=10132) Found 1517 genes among all datasets (pid=10132) [[0. 0.26570681] (pid=10132) [0. 0. ]] (pid=10132) Processing datasets (0, 1) (pid=10132) 2021-08-06 15:29:14,574 cisTopic INFO Running integration with cca
We can now add the annotations to our cisTopic object and visualize them in the cell-topic UMAP. Scanorama and harmony are the ones that work best.
label_dict_x=[label_dict[key] for key in label_dict.keys()]
label_pd = pd.concat(label_dict_x, axis=1, sort=False)
label_pd.index = cistopic_obj.cell_names
label_pd.columns = ['pycisTopic_' + x for x in label_pd.columns]
cistopic_obj.add_cell_data(label_pd)
Columns ['pycisTopic_harmony_cell_type', 'pycisTopic_cca_cell_type', 'pycisTopic_ingest_cell_type', 'pycisTopic_scanorama_cell_type', 'pycisTopic_bbknn_cell_type'] will be overwritten
from pycisTopic.clust_vis import *
plot_metadata(cistopic_obj,
reduction_name='RNA+ATAC_UMAP',
variables=['cell_type'] + label_pd.columns.to_list(),
remove_nan=True,
cmap=cm.viridis,
seed=555,
num_columns=3,
color_dictionary={},
save='/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DAGs/label_transfer.pdf')
/opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:472: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later. plt.subplot(num_rows, num_columns, i)
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'wb') as f:
pickle.dump(cistopic_obj, f)
We can now create loom files to further explore the results. There are two types of loom files:
Region accessibility: These loom files include the imputed accessibility as matrix, topics as regulons and cell-topic distributions as AUC matrices. The imputed values, the cistopic object used for the imputation and the cell-topic and topic-region binarized distributions are required. Alternatively, we can also provide different clustering and the marker regions (DARs) per group per clustering.
Gene activity: These loom files contain the gene activity as matrix, scRNA-seq derived regulons and their AUC values based on gene activity. The gene activity values, the cistopic object, and scRNA-seq derived regulons are required. Alternatively, we can also provide different clustering and the marker genes (DAGs) per group per clustering.
In this case we will create 5 loom files: region accessibility and gene activity loom files containing all and only cells overlapping with the scRNA-seq data and a loom file using the actual gene expression values from the multiome.
We will first load the data we need for the region accessibility loom files.
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
# Load imputed accessibility
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/Imputed_accessibility.pkl', 'rb') #Here I am using pycisTopic gene activity matrix, but could be any :)
imputed_acc_obj = pickle.load(infile)
infile.close()
# Load region binarized topics
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/topic_binarization/binarized_topic_region.pkl', 'rb') #Here I am using pycisTopic gene activity matrix, but could be any :)
binarized_topic_region = pickle.load(infile)
infile.close()
# Load cell binarized topics
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/topic_binarization/binarized_cell_topic.pkl', 'rb') #Here I am using pycisTopic gene activity matrix, but could be any :)
binarized_cell_topic = pickle.load(infile)
infile.close()
# Load DARs
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/DARs.pkl', 'rb')
DARs_dict = pickle.load(infile)
infile.close()
# Subset regions, we will use only regions in topics and DARs here to make it faster
regions_in_topics = list(set(sum([binarized_topic_region[i].index.tolist() for i in binarized_topic_region.keys()],[])))
regions_in_DARs = list(set(sum([DARs_dict[i].index.tolist() for i in DARs_dict.keys()],[])))
selected_regions = list(set(regions_in_topics + regions_in_DARs))
# Prepare DARs dict
cluster_markers = {'cell_type': DARs_dict}
# Export to loom
from pycisTopic.loom import *
export_region_accessibility_to_loom(accessibility_matrix = imputed_acc_obj,
cistopic_obj = cistopic_obj,
binarized_topic_region = binarized_topic_region,
binarized_cell_topic = binarized_cell_topic,
out_fname = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/loom/10x_multiome_brain_pycisTopic_region_accessibility.loom',
selected_regions = selected_regions ,
selected_cells = cistopic_obj.projections['cell']['RNA+ATAC_UMAP'].index.tolist(),
cluster_annotation = ['cell_type'],
cluster_markers = cluster_markers,
tree_structure = ('10x_multiome_brain', 'pycisTopic', 'noDBL_all'),
title = 'Tutorial - Region accessibility all',
nomenclature = "hg38")
2021-08-04 17:54:25,653 cisTopic INFO Creating minimal loom
Regulon name does not seem to be compatible with SCOPE. It should include a space to allow selection of the TF.
Please run:
regulons = [r.rename(r.name.replace('(+)',' ('+str(len(r))+'g)')) for r in regulons]
or:
regulons = [r.rename(r.name.replace('(',' (')) for r in regulons]
2021-08-04 19:14:58,815 cisTopic INFO Adding annotations
2021-08-04 19:15:21,880 cisTopic INFO Adding clusterings
2021-08-04 19:15:21,927 cisTopic INFO Adding markers
No markers for nan
2021-08-04 19:15:31,233 cisTopic INFO Exporting
/opt/venv/lib/python3.8/site-packages/loomxpy/loomxpy.py:459: FutureWarning: The default value of regex will change from True to False in a future version.
regulons.columns = regulons.columns.str.replace("_?\\(", "_(")
/opt/venv/lib/python3.8/site-packages/loomxpy/loomxpy.py:437: FutureWarning: The default value of regex will change from True to False in a future version.
regulons.columns = regulons.columns.str.replace("_?\\(", "_(")
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
# Get regulons
path_to_annotated_rna_loom = "/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/add_clusters_scrublet_as_annot/10x_multiome_brain_SCENIC_SCope_output_wAnnot_noDBL.loom"
loom = SCopeLoom.read_loom(path_to_annotated_rna_loom)
regulons = get_regulons(loom)
# Get gene activity
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DAGs/Gene_activity.pkl', 'rb') #Here I am using pycisTopic gene activity matrix, but could be any :)
gene_act = pickle.load(infile)
infile.close()
# Get DAGs
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DAGs/DAGs.pkl', 'rb')
DAGs_dict = pickle.load(infile)
infile.close()
cluster_markers = {'cell_type': DAGs_dict}
# Get metadata from high-quality loom file
from pycisTopic.loom import *
cell_data = get_metadata(loom)
selected_cells = list(set(cistopic_obj.cell_names) & set(cell_data.index.to_list()))
# Add RNA embeddings
embeddings_ids = [loom.get_meta_data()['embeddings'][x]['id'] for x in range(len(loom.get_meta_data()['embeddings']))]
embeddings = {loom.get_meta_data()['embeddings'][x]['name']: pd.DataFrame(loom.get_embedding_by_id(embeddings_ids[x]), index=cell_data.index.to_list()) for x in range(len(embeddings_ids))}
rna_cistopic_obj = cistopic_obj
rna_cistopic_obj.projections['cell'].update(embeddings)
# Export
export_gene_activity_to_loom(gene_activity_matrix = gene_act,
cistopic_obj = rna_cistopic_obj,
regulons = regulons,
selected_cells = selected_cells,
out_fname = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/loom/10x_multiome_brain_pycisTopic_gene_activity.loom',
cluster_annotation = ['cell_type'],
cluster_markers = cluster_markers,
tree_structure = ('10x_multiome_brain', 'pycisTopic', 'noDBL_all'),
title = 'Tutorial - Gene activity',
nomenclature = "hg38")
2021-08-06 15:40:50,052 cisTopic INFO Creating minimal loom
Regulon name does not seem to be compatible with SCOPE. It should include a space to allow selection of the TF.
Please run:
regulons = [r.rename(r.name.replace('(+)',' ('+str(len(r))+'g)')) for r in regulons]
or:
regulons = [r.rename(r.name.replace('(',' (')) for r in regulons]
2021-08-06 15:41:03,321 ctxcore.recovery WARNING Less than 80% of the genes in HNF1B_(+) are present in the expression matrix.
2021-08-06 15:41:02,967 ctxcore.recovery WARNING Less than 80% of the genes in HIC2_(+) are present in the expression matrix.
2021-08-06 15:41:03,760 ctxcore.recovery WARNING Less than 80% of the genes in LHX4_(+) are present in the expression matrix.
2021-08-06 15:41:04,022 ctxcore.recovery WARNING Less than 80% of the genes in MLX_(+) are present in the expression matrix.
2021-08-06 15:41:04,204 ctxcore.recovery WARNING Less than 80% of the genes in MYBL1_(+) are present in the expression matrix.
2021-08-06 15:41:05,410 ctxcore.recovery WARNING Less than 80% of the genes in PAX3_(+) are present in the expression matrix.
2021-08-06 15:41:09,445 ctxcore.recovery WARNING Less than 80% of the genes in ZNF625_(+) are present in the expression matrix.
2021-08-06 15:45:40,113 cisTopic INFO Adding annotations
2021-08-06 15:45:42,228 cisTopic INFO Adding clusterings
2021-08-06 15:45:42,263 cisTopic INFO Adding markers
2021-08-06 15:45:42,821 cisTopic INFO Exporting
/opt/venv/lib/python3.8/site-packages/loomxpy/loomxpy.py:459: FutureWarning: The default value of regex will change from True to False in a future version.
regulons.columns = regulons.columns.str.replace("_?\\(", "_(")
/opt/venv/lib/python3.8/site-packages/loomxpy/loomxpy.py:437: FutureWarning: The default value of regex will change from True to False in a future version.
regulons.columns = regulons.columns.str.replace("_?\\(", "_(")
# Get DGEM
path_to_annotated_rna_loom = "/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/add_clusters_scrublet_as_annot/10x_multiome_brain_SCENIC_SCope_output_wAnnot_noDBL-2.loom"
loom = SCopeLoom.read_loom(path_to_annotated_rna_loom)
dgem = loom.ex_mtx
# Create loom
export_gene_activity_to_loom(gene_activity_matrix = dgem,
cistopic_obj = rna_cistopic_obj,
regulons = regulons,
selected_cells = selected_cells,
out_fname = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/loom/10x_multiome_brain_pycisTopic_gene_expression.loom',
cluster_annotation = ['cell_type'],
cluster_markers = None,
tree_structure = ('10x_multiome_brain', 'pycisTopic', 'noDBL_all'),
title = 'Tutorial - Gene expression',
nomenclature = "hg38")
2021-08-06 15:49:32,787 cisTopic INFO Creating minimal loom
Regulon name does not seem to be compatible with SCOPE. It should include a space to allow selection of the TF.
Please run:
regulons = [r.rename(r.name.replace('(+)',' ('+str(len(r))+'g)')) for r in regulons]
or:
regulons = [r.rename(r.name.replace('(',' (')) for r in regulons]
2021-08-06 15:58:30,061 cisTopic INFO Adding annotations
2021-08-06 15:58:31,194 cisTopic INFO Adding clusterings
2021-08-06 15:58:31,234 cisTopic INFO Exporting
pycisTopic provides a wrapper over GREAT (http://great.stanford.edu/public/html/; similar to rGREAT in R) to perform GO analysis on sets of regions. In this example we will perform this analysis on topic regions.
# Load region binarized topics
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/topic_binarization/binarized_topic_region.pkl', 'rb')
binarized_topic_region = pickle.load(infile)
infile.close()
import pyranges as pr
from pycistarget.utils import *
region_sets = {key: pr.PyRanges(region_names_to_coordinates(binarized_topic_region[key].index.tolist())) for key in binarized_topic_region.keys()}
from pycisTopic.pyGREAT import *
result = pyGREAT(region_sets, 'hg38', n_cpu=5, _temp_dir = '/scratch/leuven/313/vsc31305/ray_spill')
2021-08-06 16:24:15,451 INFO services.py:1245 -- View the Ray dashboard at http://127.0.0.1:8265
/opt/venv/lib/python3.8/site-packages/pycisTopic/clust_vis.py:524: MatplotlibDeprecationWarning: Passing non-integers as three-element position specification is deprecated since 3.3 and will be removed two minor releases later.
plt.subplot(num_rows, num_columns, i)
Let's check topic 29 for example. This is a microglia topic and we find GO terms related to immune response:
result['Topic29']['GO Biological Process'][0:10]
| Ontology | ID | Desc | BinomRank | BinomP | BinomBonfP | BinomFdrQ | RegionFoldEnrich | ExpRegions | ObsRegions | ... | HyperBonfP | HyperFdrQ | GeneFoldEnrich | ExpGenes | ObsGenes | TotalGenes | GeneSetCov | TermCov | Regions | Genes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GO Biological Process | GO:0002376 | immune system process | 1 | 2.362995e-306 | 3.109465e-302 | 3.109465e-302 | 1.810696e+0 | 2.056115e+3 | 3723 | ... | 1.413781e-32 | 2.019687e-33 | 1.303983e+0 | 9.271596e+2 | 1209 | 2329 | 1.617391e-1 | 5.191069e-1 | chr10:102241430-102241930,chr10:102503525-1025... | ABI1,ABL1,ABL2,ACAA1,ACE,ACKR2,ACPP,ACTB,ACTN1... |
| 1 | GO Biological Process | GO:0006955 | immune response | 2 | 1.302735e-235 | 1.714269e-231 | 8.571345e-232 | 2.034185e+0 | 1.139031e+3 | 2317 | ... | 1.489236e-22 | 9.928237e-24 | 1.329372e+0 | 5.927611e+2 | 788 | 1489 | 1.054181e-1 | 5.292142e-1 | chr10:102241430-102241930,chr10:102534242-1025... | ABL1,ABL2,ACAA1,ACKR2,ACPP,ACTR2,ADA2,ADAM10,A... |
| 2 | GO Biological Process | GO:0002682 | regulation of immune system process | 3 | 1.121589e-222 | 1.475899e-218 | 4.919663e-219 | 1.917789e+0 | 1.309320e+3 | 2511 | ... | 4.326286e-21 | 1.730514e-22 | 1.327480e+0 | 5.664869e+2 | 752 | 1423 | 1.006020e-1 | 5.284610e-1 | chr10:120714935-120715435,chr10:120789574-1207... | A2M,ABI1,ABL1,ABR,ACE,ACTB,ACTR2,ACTR3,ACVR1B,... |
| 3 | GO Biological Process | GO:0001775 | cell activation | 4 | 7.571003e-191 | 9.962683e-187 | 2.490671e-187 | 2.029020e+0 | 9.516909e+2 | 1931 | ... | 2.305520e-26 | 2.095927e-27 | 1.435768e+0 | 4.032686e+2 | 579 | 1013 | 7.745819e-2 | 5.715696e-1 | chr10:102241430-102241930,chr10:133241621-1332... | ABL1,ACAA1,ACPP,ACTB,ACTR2,ADA2,ADAM10,ADAM17,... |
| 4 | GO Biological Process | GO:0002684 | positive regulation of immune system process | 5 | 9.315792e-177 | 1.225865e-172 | 2.451730e-173 | 2.046779e+0 | 8.633075e+2 | 1767 | ... | 4.157824e-21 | 1.807750e-22 | 1.419811e+0 | 3.570898e+2 | 507 | 897 | 6.782609e-2 | 5.652174e-1 | chr10:120714935-120715435,chr10:120789574-1207... | ABI1,ABL1,ACTB,ACTR2,ACTR3,ACVR1B,ACVR2A,ADAM1... |
| 5 | GO Biological Process | GO:0045321 | leukocyte activation | 6 | 4.751507e-173 | 6.252508e-169 | 1.042085e-169 | 2.088927e+0 | 7.937089e+2 | 1658 | ... | 1.663429e-22 | 9.784878e-24 | 1.439543e+0 | 3.459432e+2 | 498 | 869 | 6.662207e-2 | 5.730725e-1 | chr10:133241621-133242121,chr10:133246863-1332... | ABL1,ACAA1,ACPP,ACTR2,ADA2,ADAM10,ADAM17,ADAM8... |
| 6 | GO Biological Process | GO:0006950 | response to stress | 7 | 3.930702e-166 | 5.172411e-162 | 7.389158e-163 | 1.468141e+0 | 2.829428e+3 | 4154 | ... | 5.096343e-12 | 5.045884e-14 | 1.158781e+0 | 1.301367e+3 | 1508 | 3269 | 2.017391e-1 | 4.613032e-1 | chr10:102241430-102241930,chr10:102534242-1025... | A2M,AACS,ABAT,ABCA1,ABCB4,ABCF1,ABL1,ABL2,ABRA... |
| 7 | GO Biological Process | GO:0050776 | regulation of immune response | 8 | 1.009591e-161 | 1.328521e-157 | 1.660651e-158 | 2.055692e+0 | 7.846505e+2 | 1613 | ... | 1.670593e-13 | 2.012763e-15 | 1.341185e+0 | 3.638574e+2 | 488 | 914 | 6.528428e-2 | 5.339168e-1 | chr10:120714935-120715435,chr10:120789574-1207... | A2M,ABI1,ABL1,ABR,ACTB,ACTR2,ACTR3,ADAM8,ADAR,... |
| 8 | GO Biological Process | GO:0048583 | regulation of response to stimulus | 9 | 3.210231e-157 | 4.224343e-153 | 4.693714e-154 | 1.347329e+0 | 4.015351e+3 | 5410 | ... | 5.057184e-38 | 5.057184e-38 | 1.240145e+0 | 1.546593e+3 | 1918 | 3885 | 2.565886e-1 | 4.936937e-1 | chr10:100449063-100449563,chr10:100999920-1010... | A2M,AAAS,AAK1,ABAT,ABCA1,ABCA7,ABCB4,ABHD6,ABI... |
| 9 | GO Biological Process | GO:0002252 | immune effector process | 10 | 2.684132e-147 | 3.532049e-143 | 3.532049e-144 | 2.069045e+0 | 7.046731e+2 | 1458 | ... | 1.056269e-14 | 1.467040e-16 | 1.356021e+0 | 3.598764e+2 | 488 | 904 | 6.528428e-2 | 5.398230e-1 | chr10:102241430-102241930,chr10:120714935-1207... | ABI1,ABL1,ACAA1,ACE,ACPP,ACTB,ACTR2,ACTR3,ADA2... |
10 rows × 24 columns
In topic 3, an oligodendrocyte topic, we find myelination terms:
result['Topic3']['GO Biological Process'][0:10]
| Ontology | ID | Desc | BinomRank | BinomP | BinomBonfP | BinomFdrQ | RegionFoldEnrich | ExpRegions | ObsRegions | ... | HyperBonfP | HyperFdrQ | GeneFoldEnrich | ExpGenes | ObsGenes | TotalGenes | GeneSetCov | TermCov | Regions | Genes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GO Biological Process | GO:0009987 | cellular process | 1 | 3.767155e-33 | 4.957199e-29 | 4.957199e-29 | 1.073175e+0 | 3.103873e+3 | 3331 | ... | 1.000000e+0 | 2.075332e-2 | 1.025482e+0 | 2.773330e+3 | 2844 | 14773 | 8.068085e-1 | 1.925134e-1 | chr10:100286101-100286601,chr10:100372626-1003... | A3GALT2,AAR2,AATK,ABCD4,ABCE1,ABCG1,ABHD12,ABH... |
| 1 | GO Biological Process | GO:0042552 | myelination | 2 | 7.003502e-28 | 9.215908e-24 | 4.607954e-24 | 3.299413e+0 | 3.576394e+1 | 118 | ... | 7.601440e-9 | 1.652487e-10 | 2.779204e+0 | 1.727113e+1 | 48 | 92 | 1.361702e-2 | 5.217391e-1 | chr10:132632041-132632541,chr10:132648295-1326... | ACSBG1,ADGRG6,AKT2,ANK2,ARHGEF10,ASPA,CD9,CNTN... |
| 2 | GO Biological Process | GO:0008366 | axon ensheathment | 3 | 3.563408e-27 | 4.689089e-23 | 1.563030e-23 | 3.159842e+0 | 3.860953e+1 | 122 | ... | 1.290998e-8 | 2.634689e-10 | 2.718892e+0 | 1.802205e+1 | 49 | 96 | 1.390071e-2 | 5.104167e-1 | chr10:132632041-132632541,chr10:132648295-1326... | ACSBG1,ADGRG6,AKT2,ANK2,ARHGEF10,ASPA,CD9,CLDN... |
| 3 | GO Biological Process | GO:0044699 | single-organism process | 4 | 2.723313e-25 | 3.583608e-21 | 8.959019e-22 | 1.078328e+0 | 2.925826e+3 | 3155 | ... | 1.916391e-4 | 1.878815e-6 | 1.058212e+0 | 2.377596e+3 | 2516 | 12665 | 7.137589e-1 | 1.986577e-1 | chr10:100286101-100286601,chr10:100372626-1003... | A3GALT2,ABCA2,ABCA8,ABCD4,ABCE1,ABCG1,ABHD12,A... |
| 4 | GO Biological Process | GO:0044763 | single-organism cellular process | 5 | 4.475846e-24 | 5.889766e-20 | 1.177953e-20 | 1.104844e+0 | 2.544251e+3 | 2811 | ... | 4.171571e-10 | 1.191878e-11 | 1.107057e+0 | 1.870726e+3 | 2071 | 9965 | 5.875177e-1 | 2.078274e-1 | chr10:100286101-100286601,chr10:100372626-1003... | A3GALT2,ABCD4,ABCE1,ABCG1,ABHD12,ABHD17A,ABHD2... |
| 5 | GO Biological Process | GO:0065007 | biological regulation | 6 | 7.406046e-24 | 9.745616e-20 | 1.624269e-20 | 1.089025e+0 | 2.751085e+3 | 2996 | ... | 3.902162e-4 | 3.422949e-6 | 1.064174e+0 | 2.188552e+3 | 2329 | 11658 | 6.607092e-1 | 1.997770e-1 | chr10:100286101-100286601,chr10:100372626-1003... | ABCA2,ABCE1,ABCG1,ABHD17A,ABHD17B,ABHD17C,ABHD... |
| 6 | GO Biological Process | GO:0050789 | regulation of biological process | 7 | 3.008335e-23 | 3.958668e-19 | 5.655240e-20 | 1.093839e+0 | 2.668582e+3 | 2919 | ... | 7.793830e-5 | 8.034876e-7 | 1.072408e+0 | 2.071972e+3 | 2222 | 11037 | 6.303546e-1 | 2.013228e-1 | chr10:100286101-100286601,chr10:100372626-1003... | ABCA2,ABCE1,ABCG1,ABHD17A,ABHD17B,ABHD17C,ABHD... |
| 7 | GO Biological Process | GO:0048522 | positive regulation of cellular process | 8 | 2.515835e-20 | 3.310587e-16 | 4.138234e-17 | 1.172905e+0 | 1.586659e+3 | 1861 | ... | 2.182189e-14 | 1.983808e-15 | 1.226011e+0 | 9.225036e+2 | 1131 | 4914 | 3.208511e-1 | 2.301587e-1 | chr10:102117795-102118295,chr10:102644882-1026... | ABL1,ACACB,ACER2,ACER3,ACIN1,ACR,ACTN4,ACTR3B,... |
| 8 | GO Biological Process | GO:0050794 | regulation of cellular process | 9 | 4.156232e-20 | 5.469186e-16 | 6.076873e-17 | 1.092275e+0 | 2.589093e+3 | 2828 | ... | 8.986939e-4 | 7.247531e-6 | 1.071259e+0 | 1.966844e+3 | 2107 | 10477 | 5.977305e-1 | 2.011072e-1 | chr10:100286101-100286601,chr10:100372626-1003... | ABCA2,ABCE1,ABCG1,ABHD17B,ABHD2,ABHD6,ABL1,ACA... |
| 9 | GO Biological Process | GO:0019222 | regulation of metabolic process | 10 | 1.103988e-19 | 1.452738e-15 | 1.452738e-16 | 1.149067e+0 | 1.809294e+3 | 2079 | ... | 6.159203e-5 | 6.483371e-7 | 1.123335e+0 | 1.190206e+3 | 1337 | 6340 | 3.792908e-1 | 2.108833e-1 | chr10:100372626-100373126,chr10:102117795-1021... | ABCA2,ABCE1,ABCG1,ABHD6,ABL1,ABTB2,ACACB,ACER2... |
10 rows × 24 columns
We can also retrieve regions linked to a specific GO term:
signatures_dict = {}
signatures_dict['Immune'] = get_region_signature(result, 'Topic29', 'GO Biological Process', 'immune response')
signatures_dict['Myelination'] = get_region_signature(result, 'Topic3', 'GO Biological Process', 'myelination')
# Save
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/pyGREAT/pyGREAT_dict.pkl', 'wb') as f:
pickle.dump(result, f)
Given a set of signatures (e.g. Chip-seq peaks, bulk peaks, ...), pycisTopic allows to calculate their enrichment on cells or topics. In this example, we will use the GO signatures from pyGREAT as example.
# Load imputed accessibility
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/Imputed_accessibility.pkl', 'rb') #Here I am using pycisTopic gene activity matrix, but could be any :)
imputed_acc_obj = pickle.load(infile)
infile.close()
# Load cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
from pycisTopic.signature_enrichment import *
signature_scores = signature_enrichment(imputed_acc_obj, signatures_dict, n_cpu=5)
We can add now the enrichment scores to our cisTopic object as metadata and plot them:
cistopic_obj.add_cell_data(signature_scores)
from pycisTopic.clust_vis import *
plot_metadata(cistopic_obj,
reduction_name='UMAP',
variables=['Immune', 'Myelination'],
target='cell', num_columns=2,
text_size=24,
dot_size=15,
figsize=(20,10))
We find the immune response signature enriched in microglia and the myelination signature enriched in oligodendrocytes.